Merge pull request #2 from PaddlePaddle/master

merge

Merge pull request #2 from PaddlePaddle/master
merge
97c241f3 · kirayummy · GitHub · 711b7bdc · 2e3c52a4 · 97c241f3
209 changed file
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -23,7 +23,7 @@ repos:
    sha: 5bf6c09bfa1297d3692cadd621ef95f1284e33c0
    hooks:
    -   id: check-added-large-files
-        args: [--maxkb=1024]
+        args: [--maxkb=4096]
    -   id: check-merge-conflict
    -   id: check-symlinks
    -   id: detect-private-key

--- a/README.md
+++ b/README.md
-# Paddle Graph Learning (PGL) 
+<img src="./docs/source/_static/logo.png" alt="The logo of Paddle Graph Learning (PGL)" width="320">

-[DOC](https://pgl.readthedocs.io/en/latest/) | [Quick Start](https://pgl.readthedocs.io/en/latest/instruction.html)
+[DOC](https://pgl.readthedocs.io/en/latest/) | [Quick Start](https://pgl.readthedocs.io/en/latest/quick_start/instruction.html) | [中文](./README.zh.md)

-Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle). 
+Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle).


-<img src="https://github.com/PaddlePaddle/PGL/blob/master/docs/source/_static/framework_of_pgl.png" alt="The Framework of Paddle Graph Learning (PGL)" width="800">
+<img src="./docs/source/_static/framework_of_pgl.png" alt="The Framework of Paddle Graph Learning (PGL)" width="800">


-We provide python interfaces for storing/reading/querying graph structured data and two fundamental computational interfaces, which are walk based paradigm and message-passing based paradigm as shown in the above framework of PGL, for building cutting-edge graph learning algorithms.  Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.
+The newly released PGL supports heterogeneous graph learning on both walk based paradigm and message-passing based paradigm by providing MetaPath sampling and Message Passing mechanism on heterogeneous graph. Furthermor, The newly released PGL also support distributed graph storage and some distributed training algorithms, such as distributed deep walk and distributed graphsage. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.


-## Highlight: Efficient and Flexible Message Passing Paradigm
+## Highlight: Efficiency - Support Scatter-Gather and LodTensor Message Passing

 One of the most important benefits of graph neural networks compared to other models is the ability to use node-to-node connectivity information, but coding the communication between nodes is very cumbersome. At PGL we adopt **Message Passing Paradigm** similar to [DGL](https://github.com/dmlc/dgl) to help to build a customize graph neural network easily. Users only need to write ```send``` and ```recv``` functions to easily implement a simple GCN. As shown in the following figure, for the first step the send function is defined on the edges of the graph, and the user can customize the send function ![](http://latex.codecogs.com/gif.latex?\\phi^e}) to send the message from the source to the target node. For the second step, the recv function ![](http://latex.codecogs.com/gif.latex?\\phi^v}) is responsible for aggregating ![](http://latex.codecogs.com/gif.latex?\\oplus}) messages together from different sources.



-<img src="https://github.com/PaddlePaddle/PGL/blob/master/docs/source/_static/message_passing_paradigm.png" alt="The basic idea of message passing paradigm" width="800">
+<img src="./docs/source/_static/message_passing_paradigm.png" alt="The basic idea of message passing paradigm" width="800">

-As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function ![](http://latex.codecogs.com/gif.latex?\\oplus}) on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**. 
+As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function ![](http://latex.codecogs.com/gif.latex?\\oplus}) on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.


-<img src="https://github.com/PaddlePaddle/PGL/blob/master/docs/source/_static/parallel_degree_bucketing.png" alt="The parallel degree bucketing of PGL" width="800">
+<img src="./docs/source/_static/parallel_degree_bucketing.png" alt="The parallel degree bucketing of PGL" width="800">


 Users only need to call the ```sequence_ops``` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ```sequence_pool``` to sum the neighbor message.
@@ -33,14 +33,14 @@ Users only need to call the ```sequence_ops``` functions provided by Paddle to e
 ```


-Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
+Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.

-## Performance
+### Performance


-We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
+We test all the following GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.

-| Dataset | Model |  PGL Accuracy | PGL speed (epoch time) | DGL speed (epoch time) |
+| Dataset | Model |  PGL Accuracy | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) |
 | -------- | ----- | ----------------- | ------------ | ------------------------------------ |
 | Cora | GCN |81.75% | 0.0047s | **0.0045s** |
 | Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
@@ -49,12 +49,64 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t
 | Citeseer | GCN |70.2%| **0.0045** |0.0046s|
 | Citeseer | GAT |68.8%| **0.0124s** |0.0139s|

+
+If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) that aggregates neighbor features with LSTM ignoring the order of recieved messages, the optimized message-passing in DGL will be forced to degenerate into degree bucketing scheme. The speed performance will be much slower than the one implemented in PGL. Performances may be various with different scale of the graph, in our experiments, PGL can reach up to 13 times the speed of DGL.
+
+| Dataset |   PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | Speed up|
+| -------- |  ------------ | ------------------------------------ |----|
+| Cora | **0.0186s** | 0.1638s | 8.80x|
+| Pubmed | **0.0388s** |0.5275s | 13.59x|
+| Citeseer | **0.0150s** | 0.1278s | 8.52x |
+
+## Highlight: Flexibility - Natively Support Heterogeneous Graph Learning
+
+Graph can conveniently represent the relation between things in the real world, but the categories of things and the relation between things are various. Therefore, in the heterogeneous graph, we need to distinguish the node types and edge types in the graph network. PGL models heterogeneous graphs that contain multiple node types and multiple edge types, and can describe complex connections between different types.
+
+### Support meta path walk sampling on heterogeneous graph
+
+<img src="./docs/source/_static/metapath_sampling.png" alt="The metapath sampling in heterogeneous graph" width="800">
+The left side of the figure above describes a shopping social network. The nodes above have two categories of users and goods, and the relations between users and users, users and goods, and goods and goods. The right of the above figure is a simple sampling process of MetaPath. When you input any MetaPath as UPU (user-product-user), you will find the following results
+<img src="./docs/source/_static/metapath_result.png" alt="The metapath result" width="320">
+Then on this basis, and introducing word2vec and other methods to support learning metapath2vec and other algorithms of heterogeneous graph representation.
+
+### Support Message Passing mechanism on heterogeneous graph
+
+<img src="./docs/source/_static/him_message_passing.png" alt="The message passing mechanism on heterogeneous graph" width="800">
+Because of the different node types on the heterogeneous graph, the message delivery is also different. As shown on the left, it has five neighbors, belonging to two different node types. As shown on the right of the figure above, nodes belonging to different types need to be aggregated separately during message delivery, and then merged into the final message to update the target node. On this basis, PGL supports heterogeneous graph algorithms based on message passing, such as GATNE and other algorithms.
+
+
+## Large-Scale: Support distributed graph storage and distributed training algorithms
+In most cases of large-scale graph learning, we need distributed graph storage and distributed training support. As shown in the following figure, PGL provided a general solution of large-scale training, we adopted [PaddleFleet](https://github.com/PaddlePaddle/Fleet) as our distributed parameter servers, which supports large scale distributed embeddings and a lightweighted distributed storage engine so it can easily set up a large scale distributed training algorithm with MPI clusters.
+
+<img src="./docs/source/_static/distributed_frame.png" alt="The distributed frame of PGL" width="800">
+
+
+## Model Zoo
+The following are 13 graph learning models that have been implemented in the framework. See the details [here](https://pgl.readthedocs.io/en/latest/introduction.html#highlight-tons-of-models)
+
+|Model | feature |
+|---|---|
+| GCN | Graph Convolutional Neural Networks |
+| GAT | Graph Attention Network |
+| GraphSage |Large-scale graph convolution network based on neighborhood sampling|
+| unSup-GraphSage | Unsupervised GraphSAGE |
+| LINE | Representation learning based on first-order and second-order neighbors |
+| DeepWalk | Representation learning by DFS random walk |
+| MetaPath2Vec | Representation learning based on metapath |
+| Node2Vec | The representation learning Combined with DFS and BFS  |
+| Struct2Vec | Representation learning based on structural similarity |
+| SGC | Simplified graph convolution neural network |
+| GES | The graph represents learning method with node features |
+| DGI | Unsupervised representation learning based on graph convolution network |
+| GATNE | Representation Learning of Heterogeneous Graph based on MessagePassing |
+
+The above models consists of three parts, namely, graph representation learning, graph neural network and heterogeneous graph learning, which are also divided into graph representation learning and graph neural network.
+
 ## System requirements

 PGL requires:

-* paddle >= 1.5
-* networkx
+* paddle >= 1.6
 * cython


@@ -63,15 +115,18 @@ PGL supports both Python 2 & 3

 ## Installation

-pip install pgl
-
-
+You can simply install it via pip.

+```sh
+pip install pgl
+```

 ## The Team

 PGL is developed and maintained by NLP and Paddle Teams at Baidu

+E-mail: nlp-gnn[at]baidu.com
+
 ## License

 PGL uses Apache License 2.0.
--- a/README.zh.md
+++ b/README.zh.md
+<img src="./docs/source/_static/logo.png" alt="The logo of Paddle Graph Learning (PGL)" width="320">
+
+[文档](https://pgl.readthedocs.io/en/latest/) | [快速开始](https://pgl.readthedocs.io/en/latest/quick_start/instruction.html) | [English](./README.md)
+
+Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)的高效易用的图学习框架
+
+<img src="./docs/source/_static/framework_of_pgl.png" alt="The Framework of Paddle Graph Learning (PGL)" width="800">
+
+在最新发布的PGL中引入了异构图的支持，新增MetaPath采样支持异构图表示学习，新增异构图Message Passing机制支持基于消息传递的异构图算法，利用新增的异构图接口，能轻松搭建前沿的异构图学习算法。而且，在最新发布的PGL中，同时也增加了分布式图存储以及一些分布式图学习训练算法，例如，分布式deep walk和分布式graphsage。结合PaddlePaddle深度学习框架，我们的框架基本能够覆盖大部分的图网络应用，包括图表示学习以及图神经网络。
+
+
+# 特色：高效性——支持Scatter-Gather及LodTensor消息传递
+
+
+对比于一般的模型，图神经网络模型最大的优势在于它利用了节点与节点之间连接的信息。但是，如何通过代码来实现建模这些节点连接十分的麻烦。PGL采用与[DGL](https://github.com/dmlc/dgl)相似的**消息传递范式**用于作为构建图神经网络的接口。用于只需要简单的编写```send```还有```recv```函数就能够轻松的实现一个简单的GCN网络。如下图所示，首先，send函数被定义在节点之间的边上，用户自定义send函数![](http://latex.codecogs.com/gif.latex?\\phi^e})会把消息从源点发送到目标节点。然后，recv函数![](http://latex.codecogs.com/gif.latex?\\phi^v})负责将这些消息用汇聚函数 ![](http://latex.codecogs.com/gif.latex?\\oplus}) 汇聚起来。
+
+<img src="./docs/source/_static/message_passing_paradigm.png" alt="The basic idea of message passing paradigm" width="800">
+
+如下面左图所示，为了去适配用户定义的汇聚函数，DGL使用了Degree Bucketing来将相同度的节点组合在一个块，然后将汇聚函数![](http://latex.codecogs.com/gif.latex?\\oplus})作用在每个块之上。而对于PGL的用户定义汇聚函数，我们则将消息以PaddlePaddle的[LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html)的形式处理，将若干消息看作一组变长的序列，然后利用**LodTensor在PaddlePaddle的特性进行快速平行的消息聚合**。
+
+<img src="./docs/source/_static/parallel_degree_bucketing.png" alt="The parallel degree bucketing of PGL" width="800">
+
+用户只需要简单的调用PaddlePaddle序列相关的函数```sequence_ops```就可以实现高效的消息聚合了。举个例子，下面就是简单的利用```sequence_pool```来做邻居消息求和。
+
+```python
+    import paddle.fluid as fluid
+    def recv(msg):
+        return fluid.layers.sequence_pool(msg, "sum")
+```
+
+尽管DGL用了一些内核融合（kernel fusion）的方法来将常用的sum，max等聚合函数用scatter-gather进行优化。但是对于**复杂的用户定义函数**，他们使用的Degree Bucketing算法，仅仅使用串行的方案来处理不同的分块，并不会充分利用GPU进行加速。然而，在PGL中我们使用基于LodTensor的消息传递能够充分地利用GPU的并行优化，在复杂的用户定义函数下，PGL的速度在我们的实验中甚至能够达到DGL的13倍。即使不使用scatter-gather的优化，PGL仍然有高效的性能表现。当然，我们也是提供了scatter优化的聚合函数。
+
+
+### 性能测试
+我们用Tesla V100-SXM2-16G测试了下列所有的GNN算法，每一个算法跑了200个Epoch来计算平均速度。准确率是在测试集上计算出来的，并且我们没有使用Early-stopping策略。
+
+| 数据集 | 模型 |  PGL准确率 | PGL速度 (epoch) | DGL 0.3.0 速度 (epoch) |
+| -------- | ----- | ----------------- | ------------ | ------------------------------------ |
+| Cora | GCN |81.75% | 0.0047s | **0.0045s** |
+| Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
+| Pubmed | GCN |79.2% |**0.0049s** |0.0051s |
+| Pubmed | GAT | 77% |0.0193s|**0.0144s**|
+| Citeseer | GCN |70.2%| **0.0045** |0.0046s|
+| Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
+
+如果我们使用复杂的用户定义聚合函数，例如像[GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf)那样忽略邻居信息的获取顺序，利用LSTM来聚合节点的邻居特征。DGL所使用的消息传递函数将退化成Degree Bucketing模式，在这个情况下DGL实现的模型会比PGL的慢的多。模型的性能会随着图规模而变化，在我们的实验中，PGL的速度甚至能够能达到DGL的13倍。
+
+| 数据集 |   PGL速度 (epoch) | DGL 0.3.0 速度 (epoch time) | 加速比 |
+| -------- |  ------------ | ------------------------------------ |----|
+| Cora | **0.0186s** | 0.1638s | 8.80x|
+| Pubmed | **0.0388s** |0.5275s | 13.59x|
+| Citeseer | **0.0150s** | 0.1278s | 8.52x |
+
+
+## 特色：易用性——原生支持异构图
+
+图可以很方便的表示真实世界中事物之间的联系，但是事物的类别以及事物之间的联系多种多样，因此，在异构图中，我们需要对图网络中的节点类型以及边类型进行区分。PGL针对异构图包含多种节点类型和多种边类型的特点进行建模，可以描述不同类型之间的复杂联系。
+
+### 支持异构图MetaPath walk采样
+<img src="./docs/source/_static/metapath_sampling.png" alt="The metapath sampling in heterogeneous graph" width="800">
+上图左边描述的是一个购物的社交网络，上面的节点有用户和商品两大类，关系有用户和用户之间的关系，用户和商品之间的关系以及商品和商品之间的关系。上图的右边是一个简单的MetaPath采样过程，输入metapath为UPU（user-product-user），采出结果为
+<img src="./docs/source/_static/metapath_result.png" alt="The metapath result" width="320">
+然后在此基础上引入word2vec等方法，支持异构图表示学习metapath2vec等算法。
+
+### 支持异构图Message Passing机制
+
+<img src="./docs/source/_static/him_message_passing.png" alt="The message passing mechanism on heterogeneous graph" width="800">
+在异构图上由于节点类型不同，消息传递也方式也有所不同。如上图左边，它有五个邻居节点，属于两种不同的节点类型。如上图右边，在消息传递的时候需要把属于不同类型的节点分开聚合，然后在合并成最终的消息，从而更新目标节点。在此基础上PGL支持基于消息传递的异构图算法，如GATNE等算法。
+
+
+## 特色：规模性——支持分布式图存储以及分布式学习算法
+
+在大规模的图网络学习中，通常需要多机图存储以及多机分布式训练。如下图所示，PGL提供一套大规模训练的解决方案，我们利用[PaddleFleet](https://github.com/PaddlePaddle/Fleet)(支持大规模分布式Embedding学习)作为我们参数服务器模块以及一套简易的分布式存储方案，可以轻松在MPI集群上搭建分布式大规模图学习方法。
+
+<img src="./docs/source/_static/distributed_frame.png" alt="The distributed frame of PGL" width="800">
+
+
+## 丰富性——覆盖业界大部分图学习网络
+
+下列是框架中已经自带实现的十三种图网络学习模型。详情请参考[这里](https://pgl.readthedocs.io/en/latest/introduction.html#highlight-tons-of-models)
+
+| 模型 | 特点 |
+|---|---|
+| GCN | 图卷积网络 |
+| GAT | 基于Attention的图卷积网络 |
+| GraphSage | 基于邻居采样的大规模图卷积网络 |
+| unSup-GraphSage | 无监督学习的GraphSAGE |  
+| LINE | 基于一阶、二阶邻居的表示学习 |  
+| DeepWalk | DFS随机游走的表示学习 |  
+| MetaPath2Vec | 基于metapath的表示学习 |
+| Node2Vec | 结合DFS及BFS的表示学习 | 
+| Struct2Vec | 基于结构相似的表示学习 |
+| SGC | 简化的图卷积网络 | 
+| GES | 加入节点特征的图表示学习方法 | 
+| DGI | 基于图卷积网络的无监督表示学习 |
+| GATNE | 基于MessagePassing的异构图表示学习 |
+
+上述模型包含图表示学习，图神经网络以及异构图三部分，而异构图里面也分图表示学习和图神经网络。
+
+
+## 依赖
+
+PGL依赖于:
+
+* paddle >= 1.6
+* cython
+
+
+PGL支持Python 2和3。
+
+
+## 安装
+
+你可以简单的用pip进行安装。
+
+```sh
+pip install pgl
+```
+
+## 团队
+
+PGL由百度的NLP以及Paddle团队共同开发以及维护。
+
+联系方式 E-mail: nlp-gnn[at]baidu.com
+
+## License
+
+PGL uses Apache License 2.0.
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
 sphinx==2.1.0
 mistune
 sphinx_rtd_theme
+numpy >= 1.16.4
+cython >= 0.25.2
+paddlepaddle
+pgl
--- a/docs/source/_static/distributed_frame.png
+++ b/docs/source/_static/distributed_frame.png
--- a/docs/source/_static/framework_of_pgl.png
+++ b/docs/source/_static/framework_of_pgl.png
--- a/docs/source/_static/him_message_passing.png
+++ b/docs/source/_static/him_message_passing.png
--- a/docs/source/_static/logo.png
+++ b/docs/source/_static/logo.png
--- a/docs/source/_static/message_passing_paradigm.png
+++ b/docs/source/_static/message_passing_paradigm.png
--- a/docs/source/_static/metapath_result.png
+++ b/docs/source/_static/metapath_result.png
--- a/docs/source/_static/metapath_sampling.png
+++ b/docs/source/_static/metapath_sampling.png
--- a/docs/source/_static/parallel_degree_bucketing.png
+++ b/docs/source/_static/parallel_degree_bucketing.png
--- a/docs/source/api/pgl.contrib.heter_graph.rst
+++ b/docs/source/api/pgl.contrib.heter_graph.rst
+pgl.contrib.heter\_graph module: Heterogenous Graph Storage
+===============================
+
+.. automodule:: pgl.contrib.heter_graph
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/api/pgl.contrib.heter_graph_wrapper.rst
+++ b/docs/source/api/pgl.contrib.heter_graph_wrapper.rst
+pgl.contrib.heter\_graph\_wrapper module: Heterogenous Graph data holders for Paddle GNN.
+=========================
+
+.. automodule:: pgl.contrib.heter_graph_wrapper
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/api/pgl.rst
+++ b/docs/source/api/pgl.rst
@@ -8,3 +8,6 @@ API Reference
   pgl.layers
   pgl.data_loader
   pgl.utils.paddle_helper
+   pgl.utils.mp_reader
+   pgl.contrib.heter_graph
+   pgl.contrib.heter_graph_wrapper
--- a/docs/source/api/pgl.utils.mp_reader.rst
+++ b/docs/source/api/pgl.utils.mp_reader.rst
+pgl.utils.mp\_reader module: MultiProcessing reader helper function for Paddle.
+===============================
+
+.. automodule:: pgl.utils.mp_reader
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@@ -40,7 +40,7 @@ copyright = '2019, PaddlePaddle'
 author = 'PaddlePaddle'

 # The full version, including alpha/beta/rc tags
-release = '0.1.0.beta'
+release = '1.0.1'

 # -- General configuration ---------------------------------------------------

@@ -73,13 +73,12 @@ lanaguage = "zh_cn"
 html_theme = "sphinx_rtd_theme"
 html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
 html_show_sourcelink = False
-#html_logo = 'pgl_logo.png'
+html_logo = '_static/logo.png'

 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ['_static']
-'''
 html_theme_options = {
    'canonical_url': '',
    'analytics_id': 'UA-XXXXXXX-1',  #  Provided by Google in your dashboard
@@ -96,4 +95,3 @@ html_theme_options = {
    'includehidden': True,
    'titles_only': False
 }
-'''
--- a/docs/source/examples/dgi_examples.rst
+++ b/docs/source/examples/dgi_examples.rst
+.. mdinclude:: ../../../examples/dgi/README.md
--- a/docs/source/examples/distribute_deepwalk_examples.rst
+++ b/docs/source/examples/distribute_deepwalk_examples.rst
+.. mdinclude:: ../../../examples/distribute_deepwalk/README.md
--- a/docs/source/examples/distribute_graphsage_examples.rst
+++ b/docs/source/examples/distribute_graphsage_examples.rst
+.. mdinclude:: ../../../examples/distribute_graphsage/README.md
--- a/docs/source/examples/gat_examples.rst
+++ b/docs/source/examples/gat_examples.rst
-.. mdinclude:: md/gat_examples.md
+.. mdinclude:: ../../../examples/gat/README.md
--- a/docs/source/examples/gat_examples_code.rst
+++ b/docs/source/examples/gat_examples_code.rst
-View the Code
-=============
-
-examples/gat/train.py
-
-.. literalinclude:: ../../../examples/gat/train.py
-    :language: python
-    :linenos:
--- a/docs/source/examples/gatne_examples.rst
+++ b/docs/source/examples/gatne_examples.rst
+.. mdinclude:: ../../../examples/GATNE/README.md
--- a/docs/source/examples/gcn_examples.rst
+++ b/docs/source/examples/gcn_examples.rst
-.. mdinclude:: md/gcn_examples.md
+.. mdinclude:: ../../../examples/gcn/README.md
--- a/docs/source/examples/gcn_examples_code.rst
+++ b/docs/source/examples/gcn_examples_code.rst
-View the Code
-=============
-
-examples/gcn/train.py
-
-.. literalinclude:: ../../../examples/gcn/train.py
-    :language: python
-    :linenos:
--- a/docs/source/examples/ges_examples.rst
+++ b/docs/source/examples/ges_examples.rst
+.. mdinclude:: ../../../examples/ges/README.md
--- a/docs/source/examples/graphsage_examples.rst
+++ b/docs/source/examples/graphsage_examples.rst
-.. mdinclude:: md/graphsage_examples.md
+.. mdinclude:: ../../../examples/graphsage/README.md
--- a/docs/source/examples/graphsage_examples_code.rst
+++ b/docs/source/examples/graphsage_examples_code.rst
-View the Code
-=============
-
-examples/graphsage/train.py
-
-.. literalinclude:: ../../../examples/graphsage/train.py
-    :language: python
-    :linenos:
-
-examples/graphsage/reader.py
-
-.. literalinclude:: ../../../examples/graphsage/reader.py
-    :language: python
-    :linenos:
-
-examples/graphsage/model.py
-
-.. literalinclude:: ../../../examples/graphsage/model.py
-    :language: python
-    :linenos:
--- a/docs/source/examples/line_examples.rst
+++ b/docs/source/examples/line_examples.rst
+.. mdinclude:: ../../../examples/line/README.md
--- a/docs/source/examples/md/gat_examples.md
+++ b/docs/source/examples/md/gat_examples.md
-# Building Graph Attention Networks
-
-[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks.
-### Simple example to build single head GAT
-
-To build a gat layer,  one can use our pre-defined ```pgl.layers.gat``` or just write a gat layer with message passing interface.
-```python
-import paddle.fluid as fluid
-def gat_layer(graph_wrapper, node_feature, hidden_size):
-    def send_func(src_feat, dst_feat, edge_feat):
-        logits = src_feat["a1"] + dst_feat["a2"]
-        logits = fluid.layers.leaky_relu(logits, alpha=0.2)
-        return {"logits": logits, "h": src_feat }
-
-    def recv_func(msg):
-        norm = fluid.layers.sequence_softmax(msg["logits"])
-        output = msg["h"] * norm
-        return output
-
-    h = fluid.layers.fc(node_feature, hidden_size, bias_attr=False, name="hidden")
-    a1 = fluid.layers.fc(node_feature, 1, name="a1_weight")
-    a2 = fluid.layers.fc(node_feature, 1, name="a2_weight")
-    message = graph_wrapper.send(send_func,
-            nfeat_list=[("h", h), ("a1", a1), ("a2", a2)])
-    output = graph_wrapper.recv(recv_func, message)
-    return output
-```
-
-### Datasets
-
-The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
-
-### Dependencies
-
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
-
-### Performance
-
-We train our models for 200 epochs and report the accuracy on the test dataset.
-
-| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
-| --- | --- | --- |---|
-| Cora | ~83% | 0.0188s | 0.0175s |
-| Pubmed | ~78% | 0.0449s  | 0.0295s |
-| Citeseer | ~70% | 0.0275 | 0.0253s |
-
-### How to run
-
-For examples, use gpu to train gat on cora dataset.
-```
-python train.py --dataset cora --use_cuda
-```
-
-#### Hyperparameters
-
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
-
-
-### View the Code
-
-See the code [here](gat_examples_code.html)
--- a/docs/source/examples/md/gcn_examples.md
+++ b/docs/source/examples/md/gcn_examples.md
-# Building Graph Convolutional Network
-
-
-[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks.
-
-### Simple example to build GCN
-
-To build a gcn layer, one can use our pre-defined ```pgl.layers.gcn``` or just write a gcn layer with message passing interface.
-```python
-import paddle.fluid as fluid
-def gcn_layer(graph_wrapper, node_feature, hidden_size, act):
-    def send_func(src_feat, dst_feat, edge_feat):
-        return src_feat["h"]
-    
-    def recv_func(msg):
-        return fluid.layers.sequence_pool(msg, "sum")
-    
-    message = graph_wrapper.send(send_func, nfeat_list=[("h", node_feature)])
-    output = graph_wrapper.recv(recv_func, message)
-    output = fluid.layers.fc(output, size=hidden_size, act=act)
-    return output
-```
-
-### Datasets
-
-The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
-
-### Dependencies
-
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
-
-### Performance
-
-We train our models for 200 epochs and report the accuracy on the test dataset.
-
-| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
-| --- | --- | --- |---|
-| Cora | ~81% | 0.0106s | 0.0104s | 
-| Pubmed | ~79% | 0.0210s  | 0.0154s |
-| Citeseer | ~71% | 0.0175s | 0.0177s | 
-
-
-### How to run
-
-For examples, use gpu to train gcn on cora dataset.
-```
-python train.py --dataset cora --use_cuda
-```
-
-#### Hyperparameters
-
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda. 
-
-
-### View the Code
-
-See the code [here](gcn_examples_code.html)
--- a/docs/source/examples/md/node2vec_examples.md
+++ b/docs/source/examples/md/node2vec_examples.md
-# Graph Representation Learning: Node2vec
-
-
-[Node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce node2vec algorithms and reach the same level of indicators as the paper.
-## Datasets
-The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3) and [Arxiv](http://snap.stanford.edu/data/ca-AstroPh.html).
-## Dependencies
- paddlepaddle>=1.4
- pgl
-
-## How to run
-
-For examples, use gpu to train gcn on cora dataset.
-```sh
-# multiclass task example
-python node2vec.py --use_cuda --dataset BlogCatalog --save_path ./tmp/node2vec_BlogCatalog/ --offline_learning --epoch 400
-
-python multi_class.py --use_cuda --ckpt_path ./tmp/node2vec_BlogCatalog/paddle_model --epoch 1000
-
-# link prediction task example
-python node2vec.py --use_cuda --dataset ArXiv --save_path
-./tmp/node2vec_ArXiv --offline_learning --epoch 10
-
-python link_predict.py --use_cuda --ckpt_path ./tmp/node2vec_ArXiv/paddle_model --epoch 400
-```
-## Hyperparameters
- dataset: The citation dataset "BlogCatalog" and "ArXiv".
- use_cuda: Use gpu if assign use_cuda.
-
-### Experiment results
-Dataset|model|Task|Metric|PGL Result|Reported Result
--|--|--|--|--|--
-BlogCatalog|deepwalk|multi-label classification|MacroF1|0.250|0.211
-BlogCatalog|node2vec|multi-label classification|MacroF1|0.262|0.258
-ArXiv|deepwalk|link prediction|AUC|0.9538|0.9340
-ArXiv|node2vec|link prediction|AUC|0.9541|0.9366
-
-
-## View the Code
-
-See the code [here](node2vec_examples_code.html)
--- a/docs/source/examples/md/static_gat_examples.md
+++ b/docs/source/examples/md/static_gat_examples.md
-# StaticGraphWrapper for GAT Speed Optimization
-
-[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks.
-
-However, different from the reproduction in **examples/gat**, we use `pgl.graph_wrapper.StaticGraphWrapper` to preload the graph data into gpu or cpu memories which achieves better performance on speed.
-
-
-### Datasets
-
-The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
-
-### Dependencies
-
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
-
-### Performance
-
-We train our models for 200 epochs and report the accuracy on the test dataset.
-
-
-| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gat | Improvement |
-| --- | --- | --- |---| --- | --- |
-| Cora | ~83% | 0.0145s | 0.0119s | 0.0175s | 1.47x |
-| Pubmed | ~78% | 0.0352s | 0.0193s |0.0295s | 1.53x |
-| Citeseer | ~70% | 0.0148s | 0.0124s |0.0253s | 2.04x |
-
-### How to run
-
-For examples, use gpu to train gat on cora dataset.
-```sh
-python train.py --dataset cora --use_cuda
-```
-
-#### Hyperparameters
-
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda. 
-
-### View the Code
-
-See the code [here](static_gat_examples_code.html)
--- a/docs/source/examples/md/static_gcn_examples.md
+++ b/docs/source/examples/md/static_gcn_examples.md
-# StaticGraphWrapper for GCN Speed Optimization
-
-[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks.
-
-However, different from the reproduction in **examples/gcn**, we use `pgl.graph_wrapper.StaticGraphWrapper` to preload the graph data into gpu or cpu memories which achieves better performance on speed.
-
-### Datasets
-
-The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
-
-### Dependencies
-
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
-
-### Performance
-
-We train our models for 200 epochs and report the accuracy on the test dataset.
-
-
-| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gcn | Improvement |
-| --- | --- | --- |---| --- | --- |
-| Cora | ~81% | 0.0053s | 0.0047s | 0.0104s | 2.21x |
-| Pubmed | ~79% | 0.0105s  | 0.0049s |0.0154s | 3.14x |
-| Citeseer | ~71% | 0.0051s | 0.0045s |0.0177s | 3.93x |
-
-
-
-### How to run
-
-For examples, use gpu to train gcn on cora dataset.
-```sh
-python train.py --dataset cora --use_cuda
-```
-
-#### Hyperparameters
-
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
-
-
-### View the Code
-
-See the code [here](static_gcn_examples_code.html)
--- a/docs/source/examples/metapath2vec_examples.rst
+++ b/docs/source/examples/metapath2vec_examples.rst
+.. mdinclude:: ../../../examples/metapath2vec/README.md
--- a/docs/source/examples/node2vec_examples.rst
+++ b/docs/source/examples/node2vec_examples.rst
-.. mdinclude:: md/node2vec_examples.md
+.. mdinclude:: ../../../examples/node2vec/README.md
--- a/docs/source/examples/node2vec_examples_code.rst
+++ b/docs/source/examples/node2vec_examples_code.rst
-View the Code
-=============
-
-examples/node2vec/node2vec.py
-
-.. literalinclude:: ../../../examples/node2vec/node2vec.py
-    :language: python
-    :linenos:
-
-examples/node2vec/multi_class.py
-
-.. literalinclude:: ../../../examples/node2vec/multi_class.py
-    :language: python
-    :linenos:
-
-examples/node2vec/link_predict.py
-
-.. literalinclude:: ../../../examples/node2vec/link_predict.py
-    :language: python
-    :linenos:
--- a/docs/source/examples/sgc_examples.rst
+++ b/docs/source/examples/sgc_examples.rst
+.. mdinclude:: ../../../examples/sgc/README.md
--- a/docs/source/examples/static_gat_examples.rst
+++ b/docs/source/examples/static_gat_examples.rst
-.. mdinclude:: md/static_gat_examples.md
+.. mdinclude:: ../../../examples/static_gat/README.md
--- a/docs/source/examples/static_gat_examples_code.rst
+++ b/docs/source/examples/static_gat_examples_code.rst
-View the Code
-=============
-
-examples/static_gat/train.py
-
-.. literalinclude:: ../../../examples/static_gat/train.py
-    :language: python
-    :linenos:
--- a/docs/source/examples/static_gcn_examples.rst
+++ b/docs/source/examples/static_gcn_examples.rst
-.. mdinclude:: md/static_gcn_examples.md
+.. mdinclude:: ../../../examples/static_gcn/README.md
--- a/docs/source/examples/static_gcn_examples_code.rst
+++ b/docs/source/examples/static_gcn_examples_code.rst
-View the Code
-=============
-
-examples/static_gcn/train.py
-
-.. literalinclude:: ../../../examples/static_gcn/train.py
-    :language: python
-    :linenos:
--- a/docs/source/examples/strucvec_examples.rst
+++ b/docs/source/examples/strucvec_examples.rst
+.. mdinclude:: ../../../examples/strucvec/README.md
--- a/docs/source/examples/unsup_graphsage_examples.rst
+++ b/docs/source/examples/unsup_graphsage_examples.rst
+.. mdinclude:: ../../../examples/unsup_graphsage/README.md
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -15,14 +15,9 @@ Quick Start
 .. toctree::
    :maxdepth: 1
    :caption: Quick Start
-    :hidden:
-
-    instruction.rst
-
-See instruction_ for quick start.
-
-.. _instruction: instruction.html

+    quick_start/instruction.rst
+    quick_start/introduction_for_hetergraph.rst


 .. toctree::
@@ -34,7 +29,16 @@ See instruction_ for quick start.
   examples/static_graph_wrapper.rst
   examples/node2vec_examples.rst
   examples/graphsage_examples.rst
-
+   examples/dgi_examples.rst
+   examples/distribute_deepwalk_examples.rst
+   examples/distribute_graphsage_examples.rst
+   examples/ges_examples.rst
+   examples/line_examples.rst
+   examples/sgc_examples.rst
+   examples/strucvec_examples.rst
+   examples/gatne_examples.rst
+   examples/metapath2vec_examples.rst
+   examples/unsup_graphsage_examples.rst

 .. toctree::
   :maxdepth: 2

--- a/docs/source/md/introduction.md
+++ b/docs/source/md/introduction.md
-# Paddle Graph Learning (PGL) 
+# Paddle Graph Learning (PGL)

-
-Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle). 
+Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle).


 <div />
@@ -9,17 +8,18 @@ Paddle Graph Learning (PGL) is an efficient and flexible graph learning framewor
 <center>The Framework of Paddle Graph Learning (PGL)</center>
 <div />

-We provide python interfaces for storing/reading/querying graph structured data and two fundamental computational interfaces, which are walk based paradigm and message-passing based paradigm as shown in the above framework of PGL, for building cutting-edge graph learning algorithms.  Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.
+The newly released PGL supports heterogeneous graph learning on both walk based paradigm and message-passing based paradigm by providing MetaPath sampling and Message Passing mechanism on heterogeneous graph. Furthermor, The newly released PGL also support distributed graph storage and some distributed training algorithms, such as distributed deep walk and distributed graphsage. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.
+
+## Highlight: Efficiency - Support Scatter-Gather and LodTensor Message Passing

-## Highlight: Efficient and Flexible <br/> Message Passing Paradigm
+One of the most important benefits of graph neural networks compared to other models is the ability to use node-to-node connectivity information, but coding the communication between nodes is very cumbersome. At PGL we adopt **Message Passing Paradigm** similar to DGL to help to build a customize graph neural network easily. Users only need to write ``send`` and ``recv`` functions to easily implement a simple GCN. As shown in the following figure, for the first step the send function is defined on the edges of the graph, and the user can customize the send function $\phi^e$ to send the message from the source to the target node. For the second step, the recv function $\phi^v$ is responsible for aggregating $\oplus$ messages together from different sources.

 <div />
 <div align=center><img src="_static/message_passing_paradigm.png" width="700"></div>
 <center>The basic idea of message passing paradigm</center>
 <div />

-
-As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function $\oplus$ on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**. 
+As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function $\oplus$ on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.


 <div/>
@@ -28,21 +28,22 @@ As shown in the left of the following figure, to adapt general user-defined mess
 <div/>


-Users only need to call the ```sequence_ops``` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ```sequence_pool``` to sum the neighbor message.
+Users only need to call the ``sequence_ops`` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ``sequence_pool`` to sum the neighbor message.
 ```python
    import paddle.fluid as fluid
    def recv(msg):
        return fluid.layers.sequence_pool(msg, "sum")
 ```

-Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.

+Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.

-## Performance
+### Performance

-We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.

-| Dataset | Model |  PGL Accuracy | PGL speed (epoch time) | DGL speed (epoch time) |
+We test all the following GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
+
+| Dataset | Model |  PGL Accuracy | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) |
 | -------- | ----- | ----------------- | ------------ | ------------------------------------ |
 | Cora | GCN |81.75% | 0.0047s | **0.0045s** |
 | Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
@@ -50,3 +51,95 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t
 | Pubmed | GAT | 77% |0.0193s|**0.0144s**|
 | Citeseer | GCN |70.2%| **0.0045** |0.0046s|
 | Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
+
+
+If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) that aggregates neighbor features with LSTM ignoring the order of recieved messages, the optimized message-passing in DGL will be forced to degenerate into degree bucketing scheme. The speed performance will be much slower than the one implemented in PGL. Performances may be various with different scale of the graph, in our experiments, PGL can reach up to 13 times the speed of DGL.
+
+| Dataset |   PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | Speed up|
+| -------- |  ------------ | ------------------------------------ |----|
+| Cora | **0.0186s** | 0.1638s | 8.80x|
+| Pubmed | **0.0388s** |0.5275s | 13.59x|
+| Citeseer | **0.0150s** | 0.1278s | 8.52x |
+
+## Highlight: Flexibility - Natively Support Heterogeneous Graph Learning
+
+Graph can conveniently represent the relation between things in the real world, but the categories of things and the relation between things are various. Therefore, in the heterogeneous graph, we need to distinguish the node types and edge types in the graph network. PGL models heterogeneous graphs that contain multiple node types and multiple edge types, and can describe complex connections between different types.
+
+### Support meta path walk sampling on heterogeneous graph
+
+<div/>
+<div align=center><img src="_static/metapath_sampling.png"  width="750"></div>
+<center>The metapath sampling in heterogeneous graph</center>
+<div/>
+The left side of the figure above describes a shopping social network. The nodes above have two categories of users and goods, and the relations between users and users, users and goods, and goods and goods. The right of the above figure is a simple sampling process of MetaPath. When you input any MetaPath as UPU (user-product-user), you will find the following results
+<div/>
+<div align=center><img src="_static/metapath_result.png"  width="300"></div>
+<center>The metapath result</center>
+<div/>
+Then on this basis, and introducing word2vec and other methods to support learning metapath2vec and other algorithms of heterogeneous graph representation.
+
+### Support Message Passing mechanism on heterogeneous graph
+
+<div/>
+<div align=center><img src="_static/him_message_passing.png"  width="750"></div>
+<center>The message passing mechanism on heterogeneous graph</center>
+<div/>
+Because of the different node types on the heterogeneous graph, the message delivery is also different. As shown on the left, it has five neighbors, belonging to two different node types. As shown on the right of the figure above, nodes belonging to different types need to be aggregated separately during message delivery, and then merged into the final message to update the target node. On this basis, PGL supports heterogeneous graph algorithms based on message passing, such as GATNE and other algorithms.
+
+
+## Large-Scale: Support distributed graph storage and distributed training algorithms
+In most cases of large-scale graph learning, we need distributed graph storage and distributed training support. As shown in the following figure, PGL provided a general solution of large-scale training, we adopted [PaddleFleet](https://github.com/PaddlePaddle/Fleet) as our distributed parameter servers, which supports large scale distributed embeddings and a lightweighted distributed storage engine so tcan easily set up a large scale distributed training algorithm with MPI clusters.
+
+<div/>
+<div align=center><img src="_static/distributed_frame.png"  width="750"></div>
+<center>The distributed frame of PGL</center>
+<div/>
+
+
+## Model Zoo
+The following are 13 graph learning models that have been implemented in the framework.
+
+|Model | feature |
+|---|---|
+| [GCN](examples/gcn_examples.html)| Graph Convolutional Neural Networks |
+| [GAT](examples/gat_examples.html)| Graph Attention Network |
+| [GraphSage](examples/graphsage_examples.html)|Large-scale graph convolution network based on neighborhood sampling|
+| [unSup-GraphSage](examples/unsup_graphsage_examples.html) | Unsupervised GraphSAGE |
+| [LINE](examples/line_examples.html)| Representation learning based on first-order and second-order neighbors |
+| [DeepWalk](examples/distribute_deepwalk_examples.html)| Representation learning by DFS random walk |
+| [MetaPath2Vec](examples/metapath2vec_examples.html)| Representation learning based on metapath |
+| [Node2Vec](examples/node2vec_examples.html)| The representation learning Combined with DFS and BFS  |
+| [Struct2Vec](examples/strucvec_examples.html)| Representation learning based on structural similarity |
+| [SGC](examples/sgc_examples.html)| Simplified graph convolution neural network |
+| [GES](examples/ges_examples.html)| The graph represents learning method with node features |
+| [DGI](examples/dgi_examples.html)| Unsupervised representation learning based on graph convolution network |
+| [GATNE](examples/gatne_examples.html)| Representation Learning of Heterogeneous Graph based on MessagePassing |
+
+The above models consists of three parts, namely, graph representation learning, graph neural network and heterogeneous graph learning, which are also divided into graph representation learning and graph neural network.
+
+## System requirements
+
+PGL requires:
+
+* paddle >= 1.6
+* cython
+
+
+PGL supports both Python 2 & 3
+
+
+## Installation
+
+You can simply install it via pip.
+
+```sh
+pip install pgl
+```
+
+## The Team
+
+PGL is developed and maintained by NLP and Paddle Teams at Baidu
+
+## License
+
+PGL uses Apache License 2.0.
--- a/docs/source/quick_start/images/heter_graph_introduction.png
+++ b/docs/source/quick_start/images/heter_graph_introduction.png
--- a/docs/source/quick_start/images/quick_start_graph.png
+++ b/docs/source/quick_start/images/quick_start_graph.png
--- a/docs/source/instruction.rst
+++ b/docs/source/instruction.rst
@@ -8,8 +8,7 @@ To install Paddle Graph Learning, we need the following packages.

 .. code-block:: sh

-    paddlepaddle >= 1.4 (Faster performance on 1.5)
-    networkx
+    paddlepaddle >= 1.6
    cython

 We can simply install pgl by pip.

--- a/docs/source/quick_start/introduction_for_hetergraph.rst
+++ b/docs/source/quick_start/introduction_for_hetergraph.rst
+Quick Start with Heterogenous Graph
+========================
+
+Install PGL
+-----------
+To install Paddle Graph Learning, we need the following packages.
+
+
+.. code-block:: sh
+
+    paddlepaddle >= 1.6
+    cython
+
+We can simply install pgl by pip.
+
+.. code-block:: sh
+
+    pip install pgl
+
+.. mdinclude:: md/quick_start_for_heterGraph.md
+
+
--- a/docs/source/md/quick_start.md
+++ b/docs/source/md/quick_start.md
 ## Step 1: using PGL to create a graph 
 Suppose we have a graph with 10 nodes and 14 edges as shown in the following figure:
-![A simple graph](_static/quick_start_graph.png)
+![A simple graph](images/quick_start_graph.png)

 Our purpose is to train a graph neural network to classify yellow and green nodes. So we can create this graph in such way:
 ```python
@@ -49,7 +49,7 @@ Currently our PGL is developed based on static computational mode of paddle (we
 import paddle.fluid as fluid

 use_cuda = False  
-place = fluid.GPUPlace(0) if use_cuda else fluid.CPUPlace()
+place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()

 # use GraphWrapper as a container for graph data to construct a graph neural network
 gw = pgl.graph_wrapper.GraphWrapper(name='graph',
@@ -95,7 +95,7 @@ After defining the GCN layer, we can construct a deeper GCN model with two GCN l
 ```python
 output = gcn_layer(gw, gw.node_feat['feature'],
                hidden_size=8, name='gcn_layer_1', activation='relu')
-output = gcn_layer(gw, output, hidden_size=2,
+output = gcn_layer(gw, output, hidden_size=1,
                name='gcn_layer_2', activation=None)
 ```


--- a/docs/source/quick_start/md/quick_start_for_heterGraph.md
+++ b/docs/source/quick_start/md/quick_start_for_heterGraph.md
+## Introduction
+
+In real world, there exists many graphs contain multiple types of nodes and edges, which we call them Heterogeneous Graphs. Obviously, heterogenous graphs are more complex than homogeneous graphs. 
+
+To deal with such heterogeneous graphs, PGL develops a graph framework to support graph neural network computations and meta-path-based sampling on heterogenous graph.
+
+The goal of this tutorial:
+* example of heterogenous graph data;
+* Understand how PGL supports computations in heterogenous graph;
+* Using PGL to implement a simple heterogenous graph neural network model to classfiy a particular type of node in a heterogenous graph network.
+
+## Example of heterogenous graph
+
+There are a lot of graph data that consists of edges and nodes of multiple types. For example, **e-commerce network** is very common heterogenous graph in real world. It contains at least two types of nodes (user and item) and two types of edges (buy and click). 
+
+The following figure depicts several users click or buy some items. This graph has two types of nodes corresponding to "user" and "item". It also contain two types of edge "buy" and "click".
+![A simple heterogenous e-commerce graph](images/heter_graph_introduction.png)
+
+## Creating a heterogenous graph with PGL 
+
+In heterogenous graph, there exists multiple edges, so we should distinguish them. In PGL, the edges are built in below format:
+```python
+edges = {
+    'click': [(0, 4), (0, 7), (1, 6), (2, 5), (3, 6)],
+    'buy': [(0, 5), (1, 4), (1, 6), (2, 7), (3, 5)],
+        }
+```
+
+In heterogenous graph, nodes are also of different types. Therefore, you need to mark the type of each node, the format of the node type is as follows:
+
+```python
+node_types = [(0, 'user'), (1, 'user'), (2, 'user'), (3, 'user'), (4, 'item'), 
+             (5, 'item'),(6, 'item'), (7, 'item')]
+```
+
+Because of the different types of edges, edge features also need to be separated by different types.
+
+```python
+import numpy as np
+
+num_nodes = len(node_types)
+
+node_features = {'features': np.random.randn(num_nodes, 8).astype("float32")}
+
+edge_num_list = []
+for edge_type in edges:
+    edge_num_list.append(len(edges[edge_type]))
+
+edge_features = {
+    'click': {'h': np.random.randn(edge_num_list[0], 4)},
+    'buy': {'h':np.random.randn(edge_num_list[1], 4)},
+}
+```
+
+Now, we can build a heterogenous graph by using PGL.
+
+```python
+import paddle.fluid as fluid
+import paddle.fluid.layers as fl
+import pgl
+from pgl.contrib import heter_graph
+from pgl.contrib import heter_graph_wrapper
+
+g = heter_graph.HeterGraph(num_nodes=num_nodes,
+                            edges=edges,
+                            node_types=node_types,
+                            node_feat=node_features,
+                            edge_feat=edge_features)
+```
+
+
+
+In PGL, we need to use graph_wrapper as a container for graph data, so here we need to create a graph_wrapper for each type of edge graph.
+
+```python
+place = fluid.CPUPlace()
+
+# create a GraphWrapper as a container for graph data
+gw = heter_graph_wrapper.HeterGraphWrapper(name='heter_graph', 
+                                    place = place, 
+                                    edge_types = g.edge_types_info(),
+                                    node_feat=g.node_feat_info(),
+                                    edge_feat=g.edge_feat_info())
+```
+
+
+
+## MessagePassing
+
+After building the heterogeneous graph, we can easily carry out the message passing mode. In this case, we have two different types of edges, so we can write a function in such way:
+
+```python
+def message_passing(gw, edge_types, features, name=''):
+    def __message(src_feat, dst_feat, edge_feat): 
+        return src_feat['h']
+    def __reduce(feat):
+        return fluid.layers.sequence_pool(feat, pool_type='sum')
+    
+    assert len(edge_types) == len(features)
+    output = []
+    for i in range(len(edge_types)):
+        msg = gw[edge_types[i]].send(__message, nfeat_list=[('h', features[i])])
+        out = gw[edge_types[i]].recv(msg, __reduce)  
+        output.append(out)
+    # list of matrix
+    return output
+```
+
+```python
+edge_types = ['click', 'buy']
+features = []
+for edge_type in edge_types:
+    features.append(gw[edge_type].node_feat['features'])
+output = message_passing(gw, edge_types, features)
+
+output = fl.concat(input=output, axis=1)
+
+output = fluid.layers.fc(output, size=4, bias_attr=False, act='relu', name='fc1')
+logits = fluid.layers.fc(output, size=1, bias_attr=False, act=None, name='fc2')
+```
+
+
+
+## data preprocessing 
+
+In this case, we implement a simple node classifier, we can use 0,1 to represent two classes.
+
+```python
+y = [0,1,0,1,0,1,1,0]  
+label = np.array(y, dtype="float32").reshape(-1,1)
+```
+
+
+
+## Setting up the training program
+The training process of the heterogeneous graph node classification model is the same as the training of other paddlepaddle-based models.
+* First we build the loss function;
+* Second, creating a optimizer;
+* Finally, creating a executor and execute the training program.
+
+```python
+node_label = fluid.layers.data("node_label", shape=[None, 1], dtype="float32", append_batch_size=False)
+
+
+loss = fluid.layers.sigmoid_cross_entropy_with_logits(x=logits, label=node_label)
+
+loss = fluid.layers.mean(loss)
+
+
+adam = fluid.optimizer.Adam(learning_rate=0.01)
+adam.minimize(loss)
+
+
+exe = fluid.Executor(place)
+exe.run(fluid.default_startup_program())
+feed_dict = gw.to_feed(g) 
+
+for epoch in range(30):
+    feed_dict['node_label'] = label
+    
+    train_loss = exe.run(fluid.default_main_program(), feed=feed_dict, fetch_list=[loss], return_numpy=True)
+    print('Epoch %d | Loss: %f'%(epoch, train_loss[0]))
+```
+
+
+
+
+
--- a/examples/GATNE/Dataset.py
+++ b/examples/GATNE/Dataset.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This file loads and preprocesses the dataset for GATNE model.
+"""
+
+import sys
+import os
+import tqdm
+import numpy as np
+import logging
+import random
+from pgl import heter_graph
+import pickle as pkl
+
+
+class Dataset(object):
+    """Implementation of Dataset class
+
+    This is a simple implementation of loading and processing dataset for GATNE model.
+
+    Args:
+        config: dict, some configure parameters.
+    """
+
+    def __init__(self, config):
+        self.train_edges_file = config['data_path'] + 'train.txt'
+        self.valid_edges_file = config['data_path'] + 'valid.txt'
+        self.test_edges_file = config['data_path'] + 'test.txt'
+        self.nodes_file = config['data_path'] + 'nodes.txt'
+
+        self.config = config
+
+        self.word2index = self.load_word2index()
+
+        self.build_graph()
+
+        self.valid_data = self.load_test_data(self.valid_edges_file)
+        self.test_data = self.load_test_data(self.test_edges_file)
+
+    def build_graph(self):
+        """Build pgl heterogeneous graph. 
+        """
+        edge_data_by_type, all_edges, all_nodes = self.load_training_data(
+            self.train_edges_file,
+            slf_loop=self.config['slf_loop'],
+            symmetry_edge=self.config['symmetry_edge'])
+
+        num_nodes = len(all_nodes)
+        node_features = {
+            'index': np.array(
+                [i for i in range(num_nodes)], dtype=np.int64).reshape(-1, 1)
+        }
+
+        self.graph = heter_graph.HeterGraph(
+            num_nodes=num_nodes,
+            edges=edge_data_by_type,
+            node_types=None,
+            node_feat=node_features)
+
+        self.edge_types = sorted(self.graph.edge_types_info())
+        logging.info('total %d nodes are loaded' % (self.graph.num_nodes))
+
+    def load_training_data(self, file_, slf_loop=True, symmetry_edge=True):
+        """Load train data from file and preprocess them.
+
+        Args:
+            file_: str, file name for loading data
+            slf_loop: bool, if true, add self loop edge for every node
+            symmetry_edge: bool, if true, add symmetry edge for every edge
+
+        """
+        logging.info('loading data from %s' % file_)
+        edge_data_by_type = dict()
+        all_edges = list()
+        all_nodes = list()
+
+        with open(file_, 'r') as reader:
+            for line in reader:
+                words = line.strip().split(' ')
+                if words[0] not in edge_data_by_type:
+                    edge_data_by_type[words[0]] = []
+                src, dst = words[1], words[2]
+                edge_data_by_type[words[0]].append((src, dst))
+                all_edges.append((src, dst))
+                all_nodes.append(src)
+                all_nodes.append(dst)
+
+                if symmetry_edge:
+                    edge_data_by_type[words[0]].append((dst, src))
+                    all_edges.append((dst, src))
+
+        all_nodes = list(set(all_nodes))
+        all_edges = list(set(all_edges))
+        #  edge_data_by_type['Base'] = all_edges
+
+        if slf_loop:
+            for e_type in edge_data_by_type.keys():
+                for n in all_nodes:
+                    edge_data_by_type[e_type].append((n, n))
+
+        # remapping to index
+        edges_by_type = {}
+        for edge_type, edges in edge_data_by_type.items():
+            res_edges = []
+            for edge in edges:
+                res_edges.append(
+                    (self.word2index[edge[0]], self.word2index[edge[1]]))
+            edges_by_type[edge_type] = res_edges
+
+        return edges_by_type, all_edges, all_nodes
+
+    def load_test_data(self, file_):
+        """Load testing data from file. 
+        """
+        logging.info('loading data from %s' % file_)
+
+        true_edge_data_by_type = {}
+        fake_edge_data_by_type = {}
+        with open(file_, 'r') as reader:
+            for line in reader:
+                words = line.strip().split(' ')
+                src, dst = self.word2index[words[1]], self.word2index[words[2]]
+                e_type = words[0]
+                if int(words[3]) == 1:  # true edges
+                    if e_type not in true_edge_data_by_type:
+                        true_edge_data_by_type[e_type] = list()
+                    true_edge_data_by_type[e_type].append((src, dst))
+                else:  # fake edges
+                    if e_type not in fake_edge_data_by_type:
+                        fake_edge_data_by_type[e_type] = list()
+                    fake_edge_data_by_type[e_type].append((src, dst))
+
+        return (true_edge_data_by_type, fake_edge_data_by_type)
+
+    def load_word2index(self):
+        """Load words(nodes) from file and map to index.
+        """
+        word2index = {}
+        with open(self.nodes_file, 'r') as reader:
+            for index, line in enumerate(reader):
+                node = line.strip()
+                word2index[node] = index
+
+        return word2index
+
+    def generate_walks(self):
+        """Generate random walks for every edge type.
+        """
+        all_walks = {}
+        for e_type in self.edge_types:
+            layer_walks = self.simulate_walks(
+                edge_type=e_type,
+                num_walks=self.config['num_walks'],
+                walk_length=self.config['walk_length'])
+
+            all_walks[e_type] = layer_walks
+
+        return all_walks
+
+    def simulate_walks(self, edge_type, num_walks, walk_length, schema=None):
+        """Generate random walks in specified edge type.
+        """
+        walks = []
+        nodes = list(range(0, self.graph[edge_type].num_nodes))
+
+        for walk_iter in tqdm.tqdm(range(num_walks)):
+            random.shuffle(nodes)
+            for node in nodes:
+                walk = self.graph[edge_type].random_walk(
+                    [node], max_depth=walk_length - 1)
+                for i in range(len(walk)):
+                    walks.append(walk[i])
+
+        return walks
+
+    def generate_pairs(self, all_walks):
+        """Generate word pairs for training.
+        """
+        logging.info(['edge_types before generate pairs', self.edge_types])
+
+        pairs = []
+        skip_window = self.config['win_size'] // 2
+        for layer_id, e_type in enumerate(self.edge_types):
+            walks = all_walks[e_type]
+            for walk in tqdm.tqdm(walks):
+                for i in range(len(walk)):
+                    for j in range(1, skip_window + 1):
+                        if i - j >= 0 and walk[i] != walk[i - j]:
+                            neg_nodes = self.graph[e_type].sample_nodes(
+                                self.config['neg_num'])
+                            pairs.append(
+                                (walk[i], walk[i - j], *neg_nodes, layer_id))
+                        if i + j < len(walk) and walk[i] != walk[i + j]:
+                            neg_nodes = self.graph[e_type].sample_nodes(
+                                self.config['neg_num'])
+                            pairs.append(
+                                (walk[i], walk[i + j], *neg_nodes, layer_id))
+        return pairs
+
+    def fetch_batch(self, pairs, batch_size, for_test=False):
+        """Produce batch pairs data for training.
+        """
+        np.random.shuffle(pairs)
+        n_batches = (len(pairs) + (batch_size - 1)) // batch_size
+        neg_num = len(pairs[0]) - 3
+
+        result = []
+        for i in range(1, n_batches):
+            batch_pairs = np.array(
+                pairs[batch_size * (i - 1):batch_size * i], dtype=np.int64)
+            x = batch_pairs[:, 0].reshape(-1, ).astype(np.int64)
+            y = batch_pairs[:, 1].reshape(-1, 1, 1).astype(np.int64)
+            neg = batch_pairs[:, 2:2 + neg_num].reshape(-1, neg_num,
+                                                        1).astype(np.int64)
+            t = batch_pairs[:, -1].reshape(-1, 1).astype(np.int64)
+            result.append((x, y, neg, t))
+        return result
+
+
+if __name__ == "__main__":
+    config = {
+        'data_path': './data/youtube/',
+        'train_pairs_file': 'train_pairs.pkl',
+        'slf_loop': True,
+        'symmetry_edge': True,
+        'num_walks': 20,
+        'walk_length': 10,
+        'win_size': 5,
+        'neg_num': 5,
+    }
+
+    log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s'
+    logging.basicConfig(level='INFO', format=log_format)
+
+    dataset = Dataset(config)
+
+    logging.info('generating walks')
+    all_walks = dataset.generate_walks()
+    logging.info('finishing generate walks')
+    logging.info(['length of all walks: ', all_walks.keys()])
+
+    train_pairs = dataset.generate_pairs(all_walks)
+    pkl.dump(train_pairs,
+             open(config['data_path'] + config['train_pairs_file'], 'wb'))
+    logging.info('finishing generate train_pairs')
--- a/examples/GATNE/README.md
+++ b/examples/GATNE/README.md
+# GATNE: General Attributed Multiplex HeTerogeneous Network Embedding
+[GATNE](https://arxiv.org/pdf/1905.01669.pdf) is a algorithms framework for embedding large-scale Attributed Multiplex Heterogeneous Networks(AMHN). Given a heterogeneous graph, which consists of nodes and edges of multiple types, it can learn continuous feature representations for every node. Based on PGL, we reproduce GATNE algorithm. 
+
+## Datasets
+YouTube dataset contains 2000 nodes, 1310617 edges and 5 edge types. And we use YouTube dataset for example.
+
+You can dowload YouTube datasets from [here](https://github.com/THUDM/GATNE/tree/master/data)
+
+After downloading the data, put them, let's say, in ./data/ . Note that the current directory is the root directory of GATNE model. Then in ./data/youtube/ directory, there are three files:
+* train.txt
+* valid.txt
+* test.txt
+
+Then you can run the below command to preprocess the data.
+```sh
+python data_process.py --input_file ./data/youtube/train.txt --output_file ./data/youtube/nodes.txt
+```
+
+## Dependencies
+- paddlepaddle>=1.6
+- pgl>=1.0.0
+
+## Hyperparameters
+All the hyper parameters are saved in config.yaml file. So before training GATNE model, you can open the config.yaml to modify the hyper parameters as you like.
+
+for example, you can change the \"use_cuda\" to \"True \" in order to use GPU for training or modify \"data_path\" to use different dataset.
+
+Some important hyper parameters in config.yaml:
+- use_cuda: use GPU to train model
+- data_path: the directory of dataset
+- lr: learning rate
+- neg_num: number of negatie samples.
+- num_walks: number of walks started from each node
+- walk_length: walk length
+
+## How to run
+Then run the below command:
+```sh
+python main.py -c config.yaml
+```
+
+### Experiment results
+|     | PGL result | Reported result |
+|:---:|------------|-----------------|
+| AUC | 84.83     | 84.61           |
+| PR  | 82.77     | 81.93           |
+| F1  | 76.98     | 76.83           |
--- a/examples/GATNE/config.yaml
+++ b/examples/GATNE/config.yaml
+task_name: train.gatne
+use_cuda: True
+log_level: info 
+seed: 1667
+
+optimizer:
+    type:
+    args:
+        lr: 0.005
+
+trainer:
+    type: trainer
+    args:
+        epochs: 2
+        log_dir: logs/
+        save_dir: checkpoints/
+        output_dir: outputs/
+    
+data_loader:
+    type: Dataset
+    args:
+        data_path: ./data/youtube/
+        train_pairs_file: train_pairs.pkl
+        batch_size: 256
+        num_walks: 20  
+        walk_length: 10 
+        win_size: 5
+        neg_num: 5
+        slf_loop: True
+        symmetry_edge: True
+
+model:
+    type: GATNE
+    args:
+        dimensions: 200 
+        edge_dim: 32 
+        att_dim: 32   
+        att_head: 1
--- a/examples/GATNE/data_process.py
+++ b/examples/GATNE/data_process.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This file preprocess the data before training.
+"""
+
+import sys
+import argparse
+
+
+def gen_nodes_file(file_, result_file):
+    """calculate the total number of nodes and save them for latter processing.
+    """
+    nodes = []
+    with open(file_, 'r') as reader:
+        for line in reader:
+            tokens = line.strip().split(' ')
+            nodes.append(tokens[1])
+            nodes.append(tokens[2])
+
+    nodes = list(set(nodes))
+    nodes.sort(key=int)
+    print('total number of nodes: %d' % len(nodes))
+    print('saving nodes file in %s' % (result_file))
+    with open(result_file, 'w') as writer:
+        for n in nodes:
+            writer.write(n + '\n')
+
+    print('finished')
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='GATNE')
+    parser.add_argument(
+        '--input_file',
+        default='./data/youtube/train.txt',
+        type=str,
+        help='input file')
+    parser.add_argument(
+        '--output_file',
+        default='./data/youtube/nodes.txt',
+        type=str,
+        help='output file')
+    args = parser.parse_args()
+
+    print('generating nodes file')
+    gen_nodes_file(args.input_file, args.output_file)
--- a/examples/GATNE/main.py
+++ b/examples/GATNE/main.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This file implement the training process of GATNE model.
+"""
+
+import os
+import argparse
+import time
+import numpy as np
+import logging
+import pickle as pkl
+
+import pgl
+from pgl.utils import paddle_helper
+import paddle
+import paddle.fluid as fluid
+import paddle.fluid.layers as fl
+
+from utils import *
+import Dataset
+import model as Model
+from sklearn.metrics import (auc, f1_score, precision_recall_curve,
+                             roc_auc_score)
+
+
+def set_seed(seed):
+    """Set random seed.
+    """
+    random.seed(seed)
+    np.random.seed(seed)
+
+
+def produce_model(exe, program, dataset, model, feed_dict):
+    """Output the learned model parameters for testing.
+    """
+    edge_types = dataset.edge_types
+    num_nodes = dataset.graph[edge_types[0]].num_nodes
+    edge_types_count = len(edge_types)
+    neg_num = dataset.config['neg_num']
+
+    final_model = {}
+    feed_dict['train_inputs'] = np.array(
+        [n for n in range(num_nodes)], dtype=np.int64).reshape(-1, )
+    feed_dict['train_labels'] = np.array(
+        [n for n in range(num_nodes)], dtype=np.int64).reshape(-1, 1, 1)
+    feed_dict['train_negs'] = np.tile(feed_dict['train_labels'],
+                                      (1, neg_num)).reshape(-1, neg_num, 1)
+
+    for i in range(edge_types_count):
+        feed_dict['train_types'] = np.array(
+            [i for _ in range(num_nodes)], dtype=np.int64).reshape(-1, 1)
+        edge_node_embed = exe.run(program,
+                                  feed=feed_dict,
+                                  fetch_list=[model.last_node_embed],
+                                  return_numpy=True)[0]
+        final_model[edge_types[i]] = edge_node_embed
+
+    return final_model
+
+
+def evaluate(final_model, edge_types, data):
+    """Calculate the AUC score, F1 score and PR score of the final model
+    """
+    edge_types_count = len(edge_types)
+    AUC, F1, PR = [], [], []
+
+    true_edge_data_by_type = data[0]
+    fake_edge_data_by_type = data[1]
+
+    for i in range(edge_types_count):
+        try:
+            local_model = final_model[edge_types[i]]
+            true_edges = true_edge_data_by_type[edge_types[i]]
+            fake_edges = fake_edge_data_by_type[edge_types[i]]
+        except Exception as e:
+            logging.warn('edge type not exists. %s' % str(e))
+            continue
+        tmp_auc, tmp_f1, tmp_pr = calculate_score(local_model, true_edges,
+                                                  fake_edges)
+        AUC.append(tmp_auc)
+        F1.append(tmp_f1)
+        PR.append(tmp_pr)
+
+    return {'AUC': np.mean(AUC), 'F1': np.mean(F1), 'PR': np.mean(PR)}
+
+
+def calculate_score(model, true_edges, fake_edges):
+    """Calculate the AUC score, F1 score and PR score of specified edge type
+    """
+    true_list = list()
+    prediction_list = list()
+    true_num = 0
+    for edge in true_edges:
+        tmp_score = get_score(model, edge)
+        if tmp_score is not None:
+            true_list.append(1)
+            prediction_list.append(tmp_score)
+            true_num += 1
+
+    for edge in fake_edges:
+        tmp_score = get_score(model, edge)
+        if tmp_score is not None:
+            true_list.append(0)
+            prediction_list.append(tmp_score)
+
+    sorted_pred = prediction_list[:]
+    sorted_pred.sort()
+    threshold = sorted_pred[-true_num]
+
+    y_pred = np.zeros(len(prediction_list), dtype=np.int32)
+    for i in range(len(prediction_list)):
+        if prediction_list[i] >= threshold:
+            y_pred[i] = 1
+
+    y_true = np.array(true_list)
+    y_scores = np.array(prediction_list)
+    ps, rs, _ = precision_recall_curve(y_true, y_scores)
+    return roc_auc_score(y_true, y_scores), f1_score(y_true, y_pred), auc(rs,
+                                                                          ps)
+
+
+def get_score(local_model, edge):
+    """Calculate the cosine similarity score between two nodes.
+    """
+    try:
+        vector1 = local_model[edge[0]]
+        vector2 = local_model[edge[1]]
+        return np.dot(vector1, vector2) / (np.linalg.norm(vector1) *
+                                           np.linalg.norm(vector2))
+    except Exception as e:
+        logging.warn('get_score warning: %s' % str(e))
+        return None
+        pass
+
+
+def run_epoch(epoch,
+              config,
+              dataset,
+              data,
+              train_prog,
+              test_prog,
+              model,
+              feed_dict,
+              exe,
+              for_test=False):
+    """Run training process of every epoch.
+    """
+    total_loss = []
+    for idx, batch_data in enumerate(data):
+        feed_dict['train_inputs'] = batch_data[0]
+        feed_dict['train_labels'] = batch_data[1]
+        feed_dict['train_negs'] = batch_data[2]
+        feed_dict['train_types'] = batch_data[3]
+
+        loss, lr = exe.run(train_prog,
+                           feed=feed_dict,
+                           fetch_list=[model.loss, model.lr],
+                           return_numpy=True)
+        total_loss.append(loss[0])
+
+        if (idx + 1) % 500 == 0:
+            avg_loss = np.mean(total_loss)
+            logging.info("epoch %d | step %d | lr %.4f | train_loss %f " %
+                         (epoch, idx + 1, lr, avg_loss))
+            total_loss = []
+
+    return avg_loss
+
+
+def save_model(program, exe, dataset, model, feed_dict, filename):
+    """Save model.
+    """
+    final_model = produce_model(exe, program, dataset, model, feed_dict)
+    logging.info('saving model in %s' % (filename))
+    pkl.dump(final_model, open(filename, 'wb'))
+
+
+def test(program, exe, dataset, model, feed_dict):
+    """Testing and validating.
+    """
+    final_model = produce_model(exe, program, dataset, model, feed_dict)
+    valid_result = evaluate(final_model, dataset.edge_types,
+                            dataset.valid_data)
+    test_result = evaluate(final_model, dataset.edge_types, dataset.test_data)
+
+    logging.info("valid_AUC %.4f | valid_PR %.4f | valid_F1 %.4f" %
+                 (valid_result['AUC'], valid_result['PR'], valid_result['F1']))
+    logging.info("test_AUC %.4f | test_PR %.4f | test_F1 %.4f" %
+                 (test_result['AUC'], test_result['PR'], test_result['F1']))
+
+    return test_result
+
+
+def main(config):
+    """main function for training GATNE model.
+    """
+    logging.info(config)
+
+    set_seed(config['seed'])
+
+    dataset = getattr(
+        Dataset, config['data_loader']['type'])(config['data_loader']['args'])
+    edge_types = dataset.graph.edge_types_info()
+    logging.info(['total edge types: ', edge_types])
+
+    # train_pairs is a list of tuple: [(src1, dst1, neg, e1), (src2, dst2, neg, e2)]
+    # e(int), edge num count, for select which edge embedding 
+    train_pairs_file = config['data_loader']['args']['data_path'] + \
+                    config['data_loader']['args']['train_pairs_file']
+    if os.path.exists(train_pairs_file):
+        logging.info('loading train pairs from pkl file %s' % train_pairs_file)
+        train_pairs = pkl.load(open(train_pairs_file, 'rb'))
+    else:
+        logging.info('generating walks')
+        all_walks = dataset.generate_walks()
+        logging.info('generating train pairs')
+        train_pairs = dataset.generate_pairs(all_walks)
+        logging.info('dumping train pairs to %s' % (train_pairs_file))
+        pkl.dump(train_pairs, open(train_pairs_file, 'wb'))
+
+    logging.info('total train pairs: %d' % (len(train_pairs)))
+    data = dataset.fetch_batch(train_pairs,
+                               config['data_loader']['args']['batch_size'])
+
+    place = fluid.CUDAPlace(0) if config['use_cuda'] else fluid.CPUPlace()
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    test_program = fluid.Program()
+
+    with fluid.program_guard(train_program, startup_program):
+        model = getattr(Model, config['model']['type'])(
+            config['model']['args'], dataset, place)
+
+    test_program = train_program.clone(for_test=True)
+    with fluid.program_guard(train_program, startup_program):
+        global_steps = len(data) * config['trainer']['args']['epochs']
+        model.backward(global_steps, config['optimizer']['args'])
+
+    # train
+    exe = fluid.Executor(place)
+    exe.run(startup_program)
+    feed_dict = model.gw.to_feed(dataset.graph)
+
+    logging.info('test before training...')
+    test(test_program, exe, dataset, model, feed_dict)
+    logging.info('training...')
+    for epoch in range(1, 1 + config['trainer']['args']['epochs']):
+        train_result = run_epoch(epoch, config['trainer']['args'], dataset,
+                                 data, train_program, test_program, model,
+                                 feed_dict, exe)
+
+        logging.info('validating and testing...')
+        test_result = test(test_program, exe, dataset, model, feed_dict)
+
+        filename = os.path.join(config['trainer']['args']['save_dir'],
+                                'dict_embed_model_epoch_%d.pkl' % (epoch))
+        save_model(test_program, exe, dataset, model, feed_dict, filename)
+
+    logging.info(
+        "final_test_AUC %.4f | final_test_PR %.4f | fianl_test_F1 %.4f" % (
+            test_result['AUC'], test_result['PR'], test_result['F1']))
+
+    logging.info('training finished')
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='GATNE')
+    parser.add_argument(
+        '-c',
+        '--config',
+        default=None,
+        type=str,
+        help='config file path (default: None)')
+    parser.add_argument(
+        '-n',
+        '--taskname',
+        default=None,
+        type=str,
+        help='task name(default: None)')
+    args = parser.parse_args()
+
+    if args.config:
+        # load config file
+        config = Config(args.config, isCreate=True, isSave=True)
+        config = config()
+    else:
+        raise AssertionError(
+            "Configuration file need to be specified. Add '-c config.yaml', for example."
+        )
+
+    log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s'
+    logging.basicConfig(
+        level=getattr(logging, config['log_level'].upper()), format=log_format)
+
+    main(config)
--- a/examples/GATNE/model.py
+++ b/examples/GATNE/model.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This file implement the GATNE model.
+"""
+
+import numpy as np
+import math
+import logging
+
+import paddle.fluid as fluid
+import paddle.fluid.layers as fl
+from pgl import heter_graph_wrapper
+
+
+class GATNE(object):
+    """Implementation of GATNE model.
+
+    Args:
+        config: dict, some configure parameters.
+        dataset: instance of Dataset class
+        place: GPU or CPU place 
+    """
+
+    def __init__(self, config, dataset, place):
+        logging.info(['model is: ', self.__class__.__name__])
+        self.config = config
+        self.graph = dataset.graph
+        self.placce = place
+        self.edge_types = sorted(self.graph.edge_types_info())
+        logging.info('edge_types in model: %s' % str(self.edge_types))
+        neg_num = dataset.config['neg_num']
+
+        # hyper parameters
+        self.num_nodes = self.graph.num_nodes
+        self.embedding_size = self.config['dimensions']
+        self.embedding_u_size = self.config['edge_dim']
+        self.dim_a = self.config['att_dim']
+        self.att_head = self.config['att_head']
+        self.edge_type_count = len(self.edge_types)
+        self.u_num = self.edge_type_count
+
+        self.gw = heter_graph_wrapper.HeterGraphWrapper(
+            name="heter_graph",
+            place=place,
+            edge_types=self.graph.edge_types_info(),
+            node_feat=self.graph.node_feat_info(),
+            edge_feat=self.graph.edge_feat_info())
+
+        self.train_inputs = fl.data(
+            'train_inputs', shape=[None], dtype='int64')
+
+        self.train_labels = fl.data(
+            'train_labels', shape=[None, 1, 1], dtype='int64')
+
+        self.train_types = fl.data(
+            'train_types', shape=[None, 1], dtype='int64')
+
+        self.train_negs = fl.data(
+            'train_negs', shape=[None, neg_num, 1], dtype='int64')
+
+        self.forward()
+
+    def forward(self):
+        """Build the GATNE net.
+        """
+        param_attr_init = fluid.initializer.Uniform(
+            low=-1.0, high=1.0, seed=np.random.randint(100))
+        embed_param_attrs = fluid.ParamAttr(
+            name='Base_node_embed', initializer=param_attr_init)
+
+        # node_embeddings
+        base_node_embed = fl.embedding(
+            input=fl.reshape(
+                self.train_inputs, shape=[-1, 1]),
+            size=[self.num_nodes, self.embedding_size],
+            param_attr=embed_param_attrs)
+
+        node_features = []
+        for edge_type in self.edge_types:
+            param_attr_init = fluid.initializer.Uniform(
+                low=-1.0, high=1.0, seed=np.random.randint(100))
+            embed_param_attrs = fluid.ParamAttr(
+                name='%s_node_embed' % edge_type, initializer=param_attr_init)
+
+            features = fl.embedding(
+                input=self.gw[edge_type].node_feat['index'],
+                size=[self.num_nodes, self.embedding_u_size],
+                param_attr=embed_param_attrs)
+
+            node_features.append(features)
+
+        # mp_output: list of embedding(self.num_nodes, dim)
+        mp_output = self.message_passing(self.gw, self.edge_types,
+                                         node_features)
+
+        # U : (num_type[m], num_nodes, dim[s])
+        node_type_embed = fl.stack(mp_output, axis=0)
+
+        # U : (num_nodes, num_type[m], dim[s])
+        node_type_embed = fl.transpose(node_type_embed, perm=[1, 0, 2])
+
+        #gather node_type_embed from train_inputs
+        node_type_embed = fl.gather(node_type_embed, self.train_inputs)
+
+        # M_r
+        trans_weights = fl.create_parameter(
+            shape=[
+                self.edge_type_count, self.embedding_u_size,
+                self.embedding_size // self.att_head
+            ],
+            attr=fluid.initializer.TruncatedNormalInitializer(
+                loc=0.0, scale=1.0 / math.sqrt(self.embedding_size)),
+            dtype='float32',
+            name='trans_w')
+
+        # W_r
+        trans_weights_s1 = fl.create_parameter(
+            shape=[self.edge_type_count, self.embedding_u_size, self.dim_a],
+            attr=fluid.initializer.TruncatedNormalInitializer(
+                loc=0.0, scale=1.0 / math.sqrt(self.embedding_size)),
+            dtype='float32',
+            name='trans_w_s1')
+
+        # w_r
+        trans_weights_s2 = fl.create_parameter(
+            shape=[self.edge_type_count, self.dim_a, self.att_head],
+            attr=fluid.initializer.TruncatedNormalInitializer(
+                loc=0.0, scale=1.0 / math.sqrt(self.embedding_size)),
+            dtype='float32',
+            name='trans_w_s2')
+
+        trans_w = fl.gather(trans_weights, self.train_types)
+        trans_w_s1 = fl.gather(trans_weights_s1, self.train_types)
+        trans_w_s2 = fl.gather(trans_weights_s2, self.train_types)
+
+        attention = self.attention(node_type_embed, trans_w_s1, trans_w_s2)
+        node_type_embed = fl.matmul(attention, node_type_embed)
+        node_embed = base_node_embed + fl.reshape(
+            fl.matmul(node_type_embed, trans_w), [-1, self.embedding_size])
+
+        self.last_node_embed = fl.l2_normalize(node_embed, axis=1)
+
+        nce_weight_initializer = fluid.initializer.TruncatedNormalInitializer(
+            loc=0.0, scale=1.0 / math.sqrt(self.embedding_size))
+        nce_weight_attrs = fluid.ParamAttr(
+            name='nce_weight', initializer=nce_weight_initializer)
+
+        weight_pos = fl.embedding(
+            input=self.train_labels,
+            size=[self.num_nodes, self.embedding_size],
+            param_attr=nce_weight_attrs)
+        weight_neg = fl.embedding(
+            input=self.train_negs,
+            size=[self.num_nodes, self.embedding_size],
+            param_attr=nce_weight_attrs)
+        tmp_node_embed = fl.unsqueeze(self.last_node_embed, axes=[1])
+        pos_logits = fl.matmul(
+            tmp_node_embed, weight_pos, transpose_y=True)  # [B, 1, 1]
+
+        neg_logits = fl.matmul(
+            tmp_node_embed, weight_neg, transpose_y=True)  # [B, 1, neg_num]
+
+        pos_score = fl.squeeze(pos_logits, axes=[1])
+        pos_score = fl.clip(pos_score, min=-10, max=10)
+        pos_score = -1.0 * fl.logsigmoid(pos_score)
+
+        neg_score = fl.squeeze(neg_logits, axes=[1])
+        neg_score = fl.clip(neg_score, min=-10, max=10)
+        neg_score = -1.0 * fl.logsigmoid(-1.0 * neg_score)
+
+        neg_score = fl.reduce_sum(neg_score, dim=1, keep_dim=True)
+        self.loss = fl.reduce_mean(pos_score + neg_score)
+
+    def attention(self, node_type_embed, trans_w_s1, trans_w_s2):
+        """Calculate attention weights.
+        """
+        attention = fl.tanh(fl.matmul(node_type_embed, trans_w_s1))
+        attention = fl.matmul(attention, trans_w_s2)
+        attention = fl.reshape(attention, [-1, self.u_num])
+        attention = fl.softmax(attention)
+        attention = fl.reshape(attention, [-1, self.att_head, self.u_num])
+        return attention
+
+    def message_passing(self, gw, edge_types, features, name=''):
+        """Message passing from source nodes to dstination nodes
+        """
+
+        def __message(src_feat, dst_feat, edge_feat):
+            """send function
+            """
+            return src_feat['h']
+
+        def __reduce(feat):
+            """recv function
+            """
+            return fluid.layers.sequence_pool(feat, pool_type='average')
+
+        if not isinstance(edge_types, list):
+            edge_types = [edge_types]
+
+        if not isinstance(features, list):
+            features = [features]
+
+        assert len(edge_types) == len(features)
+
+        output = []
+        for i in range(len(edge_types)):
+            msg = gw[edge_types[i]].send(
+                __message, nfeat_list=[('h', features[i])])
+            neigh_feat = gw[edge_types[i]].recv(msg, __reduce)
+            neigh_feat = fl.fc(neigh_feat,
+                               size=neigh_feat.shape[-1],
+                               name='neigh_fc_%d' % (i),
+                               act='sigmoid')
+            slf_feat = fl.fc(features[i],
+                             size=neigh_feat.shape[-1],
+                             name='slf_fc_%d' % (i),
+                             act='sigmoid')
+
+            out = fluid.layers.concat([slf_feat, neigh_feat], axis=1)
+            out = fl.fc(out, size=neigh_feat.shape[-1], name='fc', act=None)
+            out = fluid.layers.l2_normalize(out, axis=1)
+            output.append(out)
+
+        # list of matrix
+        return output
+
+    def backward(self, global_steps, opt_config):
+        """Build the optimizer.
+        """
+        self.lr = fl.polynomial_decay(opt_config['lr'], global_steps, 0.001)
+        adam = fluid.optimizer.Adam(learning_rate=self.lr)
+        adam.minimize(self.loss)
--- a/examples/GATNE/utils.py
+++ b/examples/GATNE/utils.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This file implement a class for model configure.
+"""
+
+import datetime
+import os
+import yaml
+import random
+import shutil
+
+
+class Config(object):
+    """Implementation of Config class for model configure.
+
+    Args:
+        config_file(str): configure filename, which is a yaml file.
+        isCreate(bool): if true, create some neccessary directories to save models, log file and other outputs.
+        isSave(bool): if true, save config_file in order to record the configure message.
+    """
+
+    def __init__(self, config_file, isCreate=False, isSave=False):
+        self.config_file = config_file
+        self.config = self.get_config_from_yaml(config_file)
+
+        if isCreate:
+            self.create_necessary_dirs()
+
+            if isSave:
+                self.save_config_file()
+
+    def get_config_from_yaml(self, yaml_file):
+        """Get the configure hyperparameters from yaml file.
+        """
+        try:
+            with open(yaml_file, 'r') as f:
+                config = yaml.load(f)
+        except Exception:
+            raise IOError("Error in parsing config file '%s'" % yaml_file)
+
+        return config
+
+    def create_necessary_dirs(self):
+        """Create some necessary directories to save some important files.
+        """
+
+        time_stamp = datetime.datetime.now().strftime('%m%d_%H%M')
+        self.config['trainer']['args']['log_dir'] = ''.join(
+            (self.config['trainer']['args']['log_dir'],
+             self.config['task_name'], '/'))  # , '.%s/' % (time_stamp)))
+        self.config['trainer']['args']['save_dir'] = ''.join(
+            (self.config['trainer']['args']['save_dir'],
+             self.config['task_name'], '/'))  # , '.%s/' % (time_stamp))) 
+        self.config['trainer']['args']['output_dir'] = ''.join(
+            (self.config['trainer']['args']['output_dir'],
+             self.config['task_name'], '/'))  # , '.%s/' % (time_stamp)))
+        #  if os.path.exists(self.config['trainer']['args']['save_dir']):
+        #      input('save_dir is existed, do you really want to continue?')
+
+        self.make_dir(self.config['trainer']['args']['log_dir'])
+        self.make_dir(self.config['trainer']['args']['save_dir'])
+        self.make_dir(self.config['trainer']['args']['output_dir'])
+
+    def save_config_file(self):
+        """Save config file so that we can know the config when we look back
+        """
+        filename = self.config_file.split('/')[-1]
+        targetpath = self.config['trainer']['args']['save_dir']
+        shutil.copyfile(self.config_file, targetpath + filename)
+
+    def make_dir(self, path):
+        """Build directory"""
+        if not os.path.exists(path):
+            os.makedirs(path)
+
+    def __getitem__(self, key):
+        """Return the configure dict"""
+        return self.config[key]
+
+    def __call__(self):
+        """__call__"""
+        return self.config
--- a/examples/dgi/README.md
+++ b/examples/dgi/README.md
+# DGI: Deep Graph Infomax 
+
+[Deep Graph Infomax \(DGI\)](https://arxiv.org/abs/1809.10341) is a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures.
+
+### Datasets
+
+The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
+
+### Dependencies
+
+- paddlepaddle>=1.6
+- pgl
+
+### Performance
+
+We use DGI to pretrain embeddings for each nodes. Then we fix the embedding to train a node classifier.
+
+| Dataset | Accuracy | 
+| --- | --- |
+| Cora | ~81% | 
+| Pubmed | ~77.6% |
+| Citeseer | ~71.3% |
+
+
+### How to run
+
+For examples, use gpu to train gcn on cora dataset.
+```
+python dgi.py --dataset cora --use_cuda
+python train.py --dataset cora --use_cuda
+```
+
+#### Hyperparameters
+
+- dataset: The citation dataset "cora", "citeseer", "pubmed".
+- use_cuda: Use gpu if assign use_cuda. 
--- a/examples/dgi/dgi.py
+++ b/examples/dgi/dgi.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    DGI Pretrain
+"""
+import os
+import pgl
+from pgl import data_loader
+from pgl.utils.logger import log
+import paddle.fluid as fluid
+import numpy as np
+import time
+import argparse
+
+
+def load(name):
+    """Load dataset"""
+    if name == 'cora':
+        dataset = data_loader.CoraDataset()
+    elif name == "pubmed":
+        dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
+    elif name == "citeseer":
+        dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
+    else:
+        raise ValueError(name + " dataset doesn't exists")
+    return dataset
+
+
+def save_param(dirname, var_name_list):
+    """save_param"""
+    for var_name in var_name_list:
+        var = fluid.global_scope().find_var(var_name)
+        var_tensor = var.get_tensor()
+        np.save(os.path.join(dirname, var_name + '.npy'), np.array(var_tensor))
+
+
+def main(args):
+    """main"""
+    dataset = load(args.dataset)
+
+    # normalize
+    indegree = dataset.graph.indegree()
+    norm = np.zeros_like(indegree, dtype="float32")
+    norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
+    dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
+
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    hidden_size = 512
+
+    with fluid.program_guard(train_program, startup_program):
+        pos_gw = pgl.graph_wrapper.GraphWrapper(
+            name="pos_graph",
+            place=place,
+            node_feat=dataset.graph.node_feat_info())
+
+        neg_gw = pgl.graph_wrapper.GraphWrapper(
+            name="neg_graph",
+            place=place,
+            node_feat=dataset.graph.node_feat_info())
+
+        positive_feat = pgl.layers.gcn(pos_gw,
+                                       pos_gw.node_feat["words"],
+                                       hidden_size,
+                                       activation="relu",
+                                       norm=pos_gw.node_feat['norm'],
+                                       name="gcn_layer_1")
+
+        negative_feat = pgl.layers.gcn(neg_gw,
+                                       neg_gw.node_feat["words"],
+                                       hidden_size,
+                                       activation="relu",
+                                       norm=neg_gw.node_feat['norm'],
+                                       name="gcn_layer_1")
+
+        summary_feat = fluid.layers.sigmoid(
+            fluid.layers.reduce_mean(
+                positive_feat, [0], keep_dim=True))
+
+        summary_feat = fluid.layers.fc(summary_feat,
+                                       hidden_size,
+                                       bias_attr=False,
+                                       name="discriminator")
+        pos_logits = fluid.layers.matmul(
+            positive_feat, summary_feat, transpose_y=True)
+        neg_logits = fluid.layers.matmul(
+            negative_feat, summary_feat, transpose_y=True)
+        pos_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
+            x=pos_logits,
+            label=fluid.layers.ones(
+                shape=[dataset.graph.num_nodes, 1], dtype="float32"))
+        neg_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
+            x=neg_logits,
+            label=fluid.layers.zeros(
+                shape=[dataset.graph.num_nodes, 1], dtype="float32"))
+        loss = fluid.layers.reduce_mean(pos_loss) + fluid.layers.reduce_mean(
+            neg_loss)
+
+        adam = fluid.optimizer.Adam(learning_rate=1e-3)
+        adam.minimize(loss)
+
+    exe = fluid.Executor(place)
+    exe.run(startup_program)
+
+    best_loss = 1e9
+    dur = []
+
+    for epoch in range(args.epoch):
+        feed_dict = pos_gw.to_feed(dataset.graph)
+        node_feat = dataset.graph.node_feat["words"].copy()
+        perm = np.arange(0, dataset.graph.num_nodes)
+        np.random.shuffle(perm)
+
+        dataset.graph.node_feat["words"] = dataset.graph.node_feat["words"][
+            perm]
+
+        feed_dict.update(neg_gw.to_feed(dataset.graph))
+        dataset.graph.node_feat["words"] = node_feat
+        if epoch >= 3:
+            t0 = time.time()
+        train_loss = exe.run(train_program,
+                             feed=feed_dict,
+                             fetch_list=[loss],
+                             return_numpy=True)
+        if train_loss[0] < best_loss:
+            best_loss = train_loss[0]
+            save_param(args.checkpoint, ["gcn_layer_1", "gcn_layer_1_bias"])
+
+        if epoch >= 3:
+            time_per_epoch = 1.0 * (time.time() - t0)
+            dur.append(time_per_epoch)
+
+        log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
+                 "Train Loss: %f " % train_loss[0])
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='DGI pretrain')
+    parser.add_argument(
+        "--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
+    parser.add_argument(
+        "--checkpoint", type=str, default="best_model", help="checkpoint")
+    parser.add_argument(
+        "--epoch", type=int, default=200, help="pretrain epochs")
+    parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/dgi/train.py
+++ b/examples/dgi/train.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    Train
+"""
+import os
+import pgl
+from pgl import data_loader
+from pgl.utils.logger import log
+import paddle.fluid as fluid
+import numpy as np
+import time
+import argparse
+
+
+def load(name):
+    """Load"""
+    if name == 'cora':
+        dataset = data_loader.CoraDataset()
+    elif name == "pubmed":
+        dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
+    elif name == "citeseer":
+        dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
+    else:
+        raise ValueError(name + " dataset doesn't exists")
+    return dataset
+
+
+def load_param(dirname, var_name_list):
+    """load_param"""
+    for var_name in var_name_list:
+        var = fluid.global_scope().find_var(var_name)
+        var_tensor = var.get_tensor()
+        var_tmp = np.load(os.path.join(dirname, var_name + '.npy'))
+        var_tensor.set(var_tmp, fluid.CPUPlace())
+
+
+def main(args):
+    """main"""
+    dataset = load(args.dataset)
+
+    # normalize
+    indegree = dataset.graph.indegree()
+    norm = np.zeros_like(indegree, dtype="float32")
+    norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
+    dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
+
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    test_program = fluid.Program()
+    hidden_size = 512
+
+    with fluid.program_guard(train_program, startup_program):
+        gw = pgl.graph_wrapper.GraphWrapper(
+            name="graph",
+            place=place,
+            node_feat=dataset.graph.node_feat_info())
+
+        output = pgl.layers.gcn(gw,
+                                gw.node_feat["words"],
+                                hidden_size,
+                                activation="relu",
+                                norm=gw.node_feat['norm'],
+                                name="gcn_layer_1")
+        output.stop_gradient = True
+        output = fluid.layers.fc(output,
+                                 dataset.num_classes,
+                                 act=None,
+                                 name="classifier")
+        node_index = fluid.layers.data(
+            "node_index",
+            shape=[None, 1],
+            dtype="int64",
+            append_batch_size=False)
+        node_label = fluid.layers.data(
+            "node_label",
+            shape=[None, 1],
+            dtype="int64",
+            append_batch_size=False)
+
+        pred = fluid.layers.gather(output, node_index)
+        loss, pred = fluid.layers.softmax_with_cross_entropy(
+            logits=pred, label=node_label, return_softmax=True)
+        acc = fluid.layers.accuracy(input=pred, label=node_label, k=1)
+        loss = fluid.layers.mean(loss)
+
+    test_program = train_program.clone(for_test=True)
+    with fluid.program_guard(train_program, startup_program):
+        adam = fluid.optimizer.Adam(learning_rate=1e-2)
+        adam.minimize(loss)
+
+    exe = fluid.Executor(place)
+    exe.run(startup_program)
+
+    load_param(args.checkpoint, ["gcn_layer_1", "gcn_layer_1_bias"])
+    feed_dict = gw.to_feed(dataset.graph)
+
+    train_index = dataset.train_index
+    train_label = np.expand_dims(dataset.y[train_index], -1)
+    train_index = np.expand_dims(train_index, -1)
+
+    val_index = dataset.val_index
+    val_label = np.expand_dims(dataset.y[val_index], -1)
+    val_index = np.expand_dims(val_index, -1)
+
+    test_index = dataset.test_index
+    test_label = np.expand_dims(dataset.y[test_index], -1)
+    test_index = np.expand_dims(test_index, -1)
+
+    dur = []
+    for epoch in range(200):
+        if epoch >= 3:
+            t0 = time.time()
+        feed_dict["node_index"] = np.array(train_index, dtype="int64")
+        feed_dict["node_label"] = np.array(train_label, dtype="int64")
+        train_loss, train_acc = exe.run(train_program,
+                                        feed=feed_dict,
+                                        fetch_list=[loss, acc],
+                                        return_numpy=True)
+
+        if epoch >= 3:
+            time_per_epoch = 1.0 * (time.time() - t0)
+            dur.append(time_per_epoch)
+        feed_dict["node_index"] = np.array(val_index, dtype="int64")
+        feed_dict["node_label"] = np.array(val_label, dtype="int64")
+        val_loss, val_acc = exe.run(test_program,
+                                    feed=feed_dict,
+                                    fetch_list=[loss, acc],
+                                    return_numpy=True)
+
+        log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
+                 "Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+                 + "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
+
+    feed_dict["node_index"] = np.array(test_index, dtype="int64")
+    feed_dict["node_label"] = np.array(test_label, dtype="int64")
+    test_loss, test_acc = exe.run(test_program,
+                                  feed=feed_dict,
+                                  fetch_list=[loss, acc],
+                                  return_numpy=True)
+    log.info("Accuracy: %f" % test_acc)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='GCN')
+    parser.add_argument(
+        "--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
+    parser.add_argument(
+        "--checkpoint", type=str, default="best_model", help="checkpoint")
+    parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/distribute_deepwalk/README.md
+++ b/examples/distribute_deepwalk/README.md
+# Distributed Deepwalk in PGL
+[Deepwalk](https://arxiv.org/pdf/1403.6652.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce distributed deepwalk algorithms and reach the same level of indicators as the paper.
+
+## Datasets
+The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3). 
+## Dependencies
+- paddlepaddle>=1.6
+- pgl>=1.0
+
+## How to run
+
+We adopt [PaddlePaddle Fleet](https://github.com/PaddlePaddle/Fleet) as our distributed training frameworks ```pgl_deepwalk.cfg``` is config file for deepwalk hyperparameter and ```local_config``` is a config file for parameter servers. By default, we have 2 pservers and 2 trainers. We can use ```cloud_run.sh``` to help you startup the parameter servers and model trainers. 
+
+For examples, train deepwalk in distributed mode on BlogCataLog dataset.
+```sh
+# train deepwalk in distributed mode.
+sh cloud_run.sh
+
+# multiclass task example
+python3 multi_class.py --use_cuda --ckpt_path ./model_path/4029 --epoch 1000
+
+```
+
+## Hyperparameters
+- dataset: The citation dataset "BlogCatalog".
+- hidden_size: Hidden size of the embedding. 
+- lr: Learning rate. 
+- neg_num: Number of negative samples.
+- epoch: Number of training epoch.
+
+### Experiment results
+Dataset|model|Task|Metric|PGL Result|Reported Result 
+--|--|--|--|--|--
+BlogCatalog|distributed deepwalk|multi-label classification|MacroF1|0.233|0.211
--- a/examples/distribute_deepwalk/cloud_run.sh
+++ b/examples/distribute_deepwalk/cloud_run.sh
+#!/bin/bash 
+
+set -x
+source ./pgl_deepwalk.cfg
+source ./local_config
+
+unset http_proxy https_proxy
+
+# build train_data
+trainer_num=`echo $PADDLE_PORT | awk -F',' '{print NF}'`
+rm -rf train_data && mkdir -p train_data 
+cd train_data
+if [[ $build_train_data == True ]];then
+    seq 0 $((num_nodes-1)) | shuf | split -l $((num_nodes/trainer_num/CPU_NUM+1))
+else
+    for i in `seq 1 $trainer_num`;do
+        touch $i
+    done
+fi
+cd - 
+
+# mkdir workspace
+if [ -d ${BASE} ]; then
+    rm -rf ${BASE}
+fi 
+mkdir ${BASE}
+
+# start ps
+for((i=0;i<${PADDLE_PSERVERS_NUM};i++))
+do
+    echo "start ps server: ${i}"
+    echo $BASE
+    TRAINING_ROLE="PSERVER" PADDLE_TRAINER_ID=${i} sh job.sh &> $BASE/pserver.$i.log & 
+done
+sleep 5s 
+
+# start trainers
+for((j=0;j<${PADDLE_TRAINERS_NUM};j++))
+do
+    echo "start ps work: ${j}"
+    TRAINING_ROLE="TRAINER" PADDLE_TRAINER_ID=${j} sh job.sh &> $BASE/worker.$j.log &
+done
--- a/examples/distribute_deepwalk/cluster_train.py
+++ b/examples/distribute_deepwalk/cluster_train.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); 
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+import os
+import math
+from multiprocessing import Process
+
+import numpy as np
+import paddle.fluid as F
+import paddle.fluid.layers as L
+from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
+import paddle.fluid.incubate.fleet.base.role_maker as role_maker
+from pgl.utils.logger import log
+from pgl import data_loader
+
+from reader import DeepwalkReader
+from model import DeepwalkModel
+from utils import get_file_list
+from utils import build_graph
+from utils import build_fake_graph
+from utils import build_gen_func
+
+
+def init_role():
+    # reset the place according to role of parameter server
+    training_role = os.getenv("TRAINING_ROLE", "TRAINER")
+    paddle_role = role_maker.Role.WORKER
+    place = F.CPUPlace()
+    if training_role == "PSERVER":
+        paddle_role = role_maker.Role.SERVER
+
+    # set the fleet runtime environment according to configure
+    ports = os.getenv("PADDLE_PORT", "6174").split(",")
+    pserver_ips = os.getenv("PADDLE_PSERVERS").split(",")  # ip,ip...
+    eplist = []
+    if len(ports) > 1:
+        # local debug mode, multi port
+        for port in ports:
+            eplist.append(':'.join([pserver_ips[0], port]))
+    else:
+        # distributed mode, multi ip
+        for ip in pserver_ips:
+            eplist.append(':'.join([ip, ports[0]]))
+
+    pserver_endpoints = eplist  # ip:port,ip:port...
+    worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    role = role_maker.UserDefinedRoleMaker(
+        current_id=trainer_id,
+        role=paddle_role,
+        worker_num=worker_num,
+        server_endpoints=pserver_endpoints)
+    fleet.init(role)
+
+
+def optimization(base_lr, loss, train_steps, optimizer='sgd'):
+    decayed_lr = L.learning_rate_scheduler.polynomial_decay(
+        learning_rate=base_lr,
+        decay_steps=train_steps,
+        end_learning_rate=0.0001 * base_lr,
+        power=1.0,
+        cycle=False)
+    if optimizer == 'sgd':
+        optimizer = F.optimizer.SGD(decayed_lr)
+    elif optimizer == 'adam':
+        optimizer = F.optimizer.Adam(decayed_lr, lazy_mode=True)
+    else:
+        raise ValueError
+
+    log.info('learning rate:%f' % (base_lr))
+    #create the DistributeTranspiler configure
+    config = DistributeTranspilerConfig()
+    config.sync_mode = False
+    #config.runtime_split_send_recv = False
+
+    config.slice_var_up = False
+    #create the distributed optimizer
+    optimizer = fleet.distributed_optimizer(optimizer, config)
+    optimizer.minimize(loss)
+
+
+def build_complied_prog(train_program, model_loss):
+    num_threads = int(os.getenv("CPU_NUM", 10))
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    exec_strategy = F.ExecutionStrategy()
+    exec_strategy.num_threads = num_threads
+    #exec_strategy.use_experimental_executor = True
+    build_strategy = F.BuildStrategy()
+    build_strategy.enable_inplace = True
+    #build_strategy.memory_optimize = True
+    build_strategy.memory_optimize = False
+    build_strategy.remove_unnecessary_lock = False
+    if num_threads > 1:
+        build_strategy.reduce_strategy = F.BuildStrategy.ReduceStrategy.Reduce
+
+    compiled_prog = F.compiler.CompiledProgram(
+        train_program).with_data_parallel(
+            loss_name=model_loss.name)
+    return compiled_prog
+
+
+def train_prog(exe, program, loss, node2vec_pyreader, args, train_steps):
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    step = 0
+    while True:
+        try:
+            begin_time = time.time()
+            loss_val, = exe.run(program, fetch_list=[loss])
+            log.info("step %s: loss %.5f speed: %.5f s/step" %
+                     (step, np.mean(loss_val), time.time() - begin_time))
+            step += 1
+        except F.core.EOFException:
+            node2vec_pyreader.reset()
+
+        if step % args.steps_per_save == 0 or step == train_steps:
+            if trainer_id == 0 or args.is_distributed:
+                model_save_dir = args.save_path
+                model_path = os.path.join(model_save_dir, str(step))
+                if not os.path.exists(model_save_dir):
+                    os.makedirs(model_save_dir)
+                fleet.save_persistables(exe, model_path)
+
+        if step == train_steps:
+            break
+
+
+def test(args):
+    graph = build_graph(args.num_nodes, args.edge_path)
+    gen_func = build_gen_func(args, graph)
+
+    start = time.time()
+    num = 10
+    for idx, _ in enumerate(gen_func()):
+        if idx % num == num - 1:
+            log.info("%s" % (1.0 * (time.time() - start) / num))
+            start = time.time()
+
+
+def walk(args):
+    graph = build_graph(args.num_nodes, args.edge_path)
+    num_sample_workers = args.num_sample_workers
+
+    if args.train_files is None or args.train_files == "None":
+        log.info("Walking from graph...")
+        train_files = [None for _ in range(num_sample_workers)]
+    else:
+        log.info("Walking from train_data...")
+        files = get_file_list(args.train_files)
+        train_files = [[] for i in range(num_sample_workers)]
+        for idx, f in enumerate(files):
+            train_files[idx % num_sample_workers].append(f)
+
+    def walk_to_file(walk_gen, filename, max_num):
+        with open(filename, "w") as outf:
+            num = 0
+            for walks in walk_gen:
+                for walk in walks:
+                    outf.write("%s\n" % "\t".join([str(i) for i in walk]))
+                    num += 1
+                    if num % 1000 == 0:
+                        log.info("Total: %s, %s walkpath is saved. " %
+                                 (max_num, num))
+                    if num == max_num:
+                        return
+
+    m_args = [(DeepwalkReader(
+        graph,
+        batch_size=args.batch_size,
+        walk_len=args.walk_len,
+        win_size=args.win_size,
+        neg_num=args.neg_num,
+        neg_sample_type=args.neg_sample_type,
+        walkpath_files=None,
+        train_files=train_files[i]).walk_generator(),
+               "%s/%s" % (args.walkpath_files, i),
+               args.epoch * args.num_nodes // args.num_sample_workers)
+              for i in range(num_sample_workers)]
+    ps = []
+    for i in range(num_sample_workers):
+        p = Process(target=walk_to_file, args=m_args[i])
+        p.start()
+        ps.append(p)
+    for i in range(num_sample_workers):
+        ps[i].join()
+
+
+def train(args):
+    import logging
+    log.setLevel(logging.DEBUG)
+    log.info("start")
+
+    worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
+    num_devices = int(os.getenv("CPU_NUM", 10))
+
+    model = DeepwalkModel(args.num_nodes, args.hidden_size, args.neg_num,
+                          args.is_sparse, args.is_distributed, 1.)
+    pyreader = model.pyreader
+    loss = model.forward()
+
+    # init fleet
+    init_role()
+
+    train_steps = math.ceil(1. * args.num_nodes * args.epoch /
+                            args.batch_size / num_devices / worker_num)
+    log.info("Train step: %s" % train_steps)
+
+    if args.optimizer == "sgd":
+        args.lr *= args.batch_size * args.walk_len * args.win_size
+    optimization(args.lr, loss, train_steps, args.optimizer)
+
+    # init and run server or worker
+    if fleet.is_server():
+        fleet.init_server(args.warm_start_from_dir)
+        fleet.run_server()
+
+    if fleet.is_worker():
+        log.info("start init worker done")
+        fleet.init_worker()
+        #just the worker, load the sample
+        log.info("init worker done")
+
+        exe = F.Executor(F.CPUPlace())
+        exe.run(fleet.startup_program)
+        log.info("Startup done")
+
+        if args.dataset is not None:
+            if args.dataset == "BlogCatalog":
+                graph = data_loader.BlogCatalogDataset().graph
+            elif args.dataset == "ArXiv":
+                graph = data_loader.ArXivDataset().graph
+            else:
+                raise ValueError(args.dataset + " dataset doesn't exists")
+            log.info("Load buildin BlogCatalog dataset done.")
+        elif args.walkpath_files is None or args.walkpath_files == "None":
+            graph = build_graph(args.num_nodes, args.edge_path)
+            log.info("Load graph from '%s' done." % args.edge_path)
+        else:
+            graph = build_fake_graph(args.num_nodes)
+            log.info("Load fake graph done.")
+
+        # bind gen
+        gen_func = build_gen_func(args, graph)
+
+        pyreader.decorate_tensor_provider(gen_func)
+        pyreader.start()
+
+        compiled_prog = build_complied_prog(fleet.main_program, loss)
+        train_prog(exe, compiled_prog, loss, pyreader, args, train_steps)
+
+
+if __name__ == '__main__':
+
+    def str2bool(v):
+        if isinstance(v, bool):
+            return v
+        if v.lower() in ('yes', 'true', 't', 'y', '1'):
+            return True
+        elif v.lower() in ('no', 'false', 'f', 'n', '0'):
+            return False
+        else:
+            raise argparse.ArgumentTypeError('Boolean value expected.')
+
+    parser = argparse.ArgumentParser(description='Deepwalk')
+    parser.add_argument(
+        "--hidden_size",
+        type=int,
+        default=64,
+        help="Hidden size of the embedding.")
+    parser.add_argument(
+        "--lr", type=float, default=0.025, help="Learning rate.")
+    parser.add_argument(
+        "--neg_num", type=int, default=5, help="Number of negative samples.")
+    parser.add_argument(
+        "--epoch", type=int, default=1, help="Number of training epoch.")
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=128,
+        help="Numbert of walk paths in a batch.")
+    parser.add_argument(
+        "--walk_len", type=int, default=40, help="Length of a walk path.")
+    parser.add_argument(
+        "--win_size", type=int, default=5, help="Window size in skip-gram.")
+    parser.add_argument(
+        "--save_path",
+        type=str,
+        default="model_path",
+        help="Output path for saving model.")
+    parser.add_argument(
+        "--num_sample_workers",
+        type=int,
+        default=1,
+        help="Number of sampling workers.")
+    parser.add_argument(
+        "--steps_per_save",
+        type=int,
+        default=3000,
+        help="Steps for model saveing.")
+    parser.add_argument(
+        "--num_nodes",
+        type=int,
+        default=10000,
+        help="Number of nodes in graph.")
+    parser.add_argument("--edge_path", type=str, default="./graph_data")
+    parser.add_argument("--train_files", type=str, default=None)
+    parser.add_argument("--walkpath_files", type=str, default=None)
+    parser.add_argument("--is_distributed", type=str2bool, default=False)
+    parser.add_argument("--is_sparse", type=str2bool, default=True)
+    parser.add_argument("--warm_start_from_dir", type=str, default=None)
+    parser.add_argument("--dataset", type=str, default=None)
+    parser.add_argument(
+        "--neg_sample_type",
+        type=str,
+        default="average",
+        choices=["average", "outdegree"])
+    parser.add_argument(
+        "--mode",
+        type=str,
+        required=False,
+        choices=['train', 'walk'],
+        default="train")
+    parser.add_argument(
+        "--optimizer",
+        type=str,
+        required=False,
+        choices=['adam', 'sgd'],
+        default="sgd")
+    args = parser.parse_args()
+    log.info(args)
+    if args.mode == "train":
+        train(args)
+    elif args.mode == "walk":
+        walk(args)
--- a/examples/distribute_deepwalk/gpu_train.py
+++ b/examples/distribute_deepwalk/gpu_train.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+import os
+
+import numpy as np
+import paddle.fluid as F
+import paddle.fluid.layers as L
+from pgl.utils.logger import log
+
+from model import DeepwalkModel
+from utils import build_graph
+from utils import build_gen_func
+
+
+def optimization(base_lr, loss, train_steps, optimizer='adam'):
+    decayed_lr = L.polynomial_decay(base_lr, train_steps, 0.0001)
+
+    if optimizer == 'sgd':
+        optimizer = F.optimizer.SGD(
+            decayed_lr,
+            regularization=F.regularizer.L2DecayRegularizer(
+                regularization_coeff=0.0025))
+    elif optimizer == 'adam':
+        # dont use gpu's lazy mode
+        optimizer = F.optimizer.Adam(decayed_lr)
+    else:
+        raise ValueError
+
+    log.info('learning rate:%f' % (base_lr))
+    optimizer.minimize(loss)
+
+
+def get_parallel_exe(program, loss):
+    exec_strategy = F.ExecutionStrategy()
+    exec_strategy.num_threads = 1  #2 for fp32 4 for fp16
+    exec_strategy.use_experimental_executor = True
+    exec_strategy.num_iteration_per_drop_scope = 1  #important shit
+
+    build_strategy = F.BuildStrategy()
+    build_strategy.enable_inplace = True
+    build_strategy.memory_optimize = True
+    build_strategy.remove_unnecessary_lock = True
+
+    #return compiled_prog
+    train_exe = F.ParallelExecutor(
+        use_cuda=True,
+        loss_name=loss.name,
+        build_strategy=build_strategy,
+        exec_strategy=exec_strategy,
+        main_program=program)
+    return train_exe
+
+
+def train(train_exe, exe, program, loss, node2vec_pyreader, args, train_steps):
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    step = 0
+    while True:
+        try:
+            begin_time = time.time()
+            loss_val, = train_exe.run(fetch_list=[loss])
+            log.info("step %s: loss %.5f speed: %.5f s/step" %
+                     (step, np.mean(loss_val), time.time() - begin_time))
+            step += 1
+        except F.core.EOFException:
+            node2vec_pyreader.reset()
+
+        if (step == train_steps or
+                step % args.steps_per_save == 0) and trainer_id == 0:
+
+            model_save_dir = args.output_path
+            model_path = os.path.join(model_save_dir, str(step))
+            if not os.path.exists(model_save_dir):
+                os.makedirs(model_save_dir)
+            F.io.save_params(exe, model_path, program)
+        if step == train_steps:
+            break
+
+
+def main(args):
+    import logging
+    log.setLevel(logging.DEBUG)
+    log.info("start")
+
+    num_devices = len(F.cuda_places())
+    model = DeepwalkModel(args.num_nodes, args.hidden_size, args.neg_num,
+                          False, False, 1.)
+    pyreader = model.pyreader
+    loss = model.forward()
+
+    train_steps = int(args.num_nodes * args.epoch / args.batch_size /
+                      num_devices)
+    optimization(args.lr * num_devices, loss, train_steps, args.optimizer)
+
+    place = F.CUDAPlace(0)
+    exe = F.Executor(place)
+    exe.run(F.default_startup_program())
+
+    graph = build_graph(args.num_nodes, args.edge_path)
+    gen_func = build_gen_func(args, graph)
+
+    pyreader.decorate_tensor_provider(gen_func)
+    pyreader.start()
+
+    train_prog = F.default_main_program()
+
+    if args.warm_start_from_dir is not None:
+        F.io.load_params(exe, args.warm_start_from_dir, train_prog)
+
+    train_exe = get_parallel_exe(train_prog, loss)
+    train(train_exe, exe, train_prog, loss, pyreader, args, train_steps)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Deepwalk')
+    parser.add_argument("--hidden_size", type=int, default=64)
+    parser.add_argument("--lr", type=float, default=0.025)
+    parser.add_argument("--neg_num", type=int, default=5)
+    parser.add_argument("--epoch", type=int, default=100)
+    parser.add_argument("--batch_size", type=int, default=128)
+    parser.add_argument("--walk_len", type=int, default=40)
+    parser.add_argument("--win_size", type=int, default=5)
+    parser.add_argument("--output_path", type=str, default="output")
+    parser.add_argument("--num_sample_workers", type=int, default=1)
+    parser.add_argument("--steps_per_save", type=int, default=3000)
+    parser.add_argument("--num_nodes", type=int, default=10000)
+    parser.add_argument("--edge_path", type=str, default="./graph_data")
+    parser.add_argument("--walkpath_files", type=str, default=None)
+    parser.add_argument("--train_files", type=str, default="./train_data")
+    parser.add_argument("--warm_start_from_dir", type=str, default=None)
+    parser.add_argument(
+        "--neg_sample_type",
+        type=str,
+        default="average",
+        choices=["average", "outdegree"])
+    parser.add_argument(
+        "--optimizer",
+        type=str,
+        required=False,
+        choices=['adam', 'sgd'],
+        default="adam")
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/distribute_deepwalk/job.sh
+++ b/examples/distribute_deepwalk/job.sh
+#!/bin/bash
+
+set -x
+source ./pgl_deepwalk.cfg
+
+export CPU_NUM=$CPU_NUM
+export FLAGS_rpc_deadline=3000000 
+export FLAGS_communicator_send_queue_size=1
+export FLAGS_communicator_min_send_grad_num_before_recv=0
+export FLAGS_communicator_max_merge_var_num=1
+export FLAGS_communicator_merge_sparse_grad=1
+
+if [[ $build_train_data == True ]];then
+    train_files="./train_data"
+else
+    train_files="None"
+fi
+    
+if [[ $pre_walk == True ]]; then
+    walkpath_files="./walk_path"
+    if [[ $TRAINING_ROLE == "PSERVER" ]];then
+        while [[ ! -d train_data ]];do
+            sleep 60
+            echo "Waiting for train_data ..."
+        done
+        rm -rf $walkpath_files && mkdir -p $walkpath_files
+        python -u cluster_train.py --num_sample_workers $num_sample_workers --num_nodes $num_nodes --mode walk --walkpath_files $walkpath_files --epoch $epoch \
+             --walk_len $walk_len --batch_size $batch_size --train_files $train_files --dataset "BlogCatalog"
+        touch build_graph_done
+    fi
+
+    while [[ ! -f build_graph_done ]];do
+        sleep 60
+        echo "Waiting for walk_path ..."
+    done
+else
+    walkpath_files="None"
+fi
+
+python -u cluster_train.py --num_sample_workers $num_sample_workers --num_nodes $num_nodes --optimizer $optimizer --walkpath_files $walkpath_files --epoch $epoch \
+            --is_distributed $distributed_embedding --lr $learning_rate --neg_num $neg_num --walk_len $walk_len --win_size $win_size --is_sparse $is_sparse --hidden_size $dim \
+            --batch_size $batch_size --steps_per_save $steps_per_save --train_files $train_files --dataset "BlogCatalog"
--- a/examples/distribute_deepwalk/local_config
+++ b/examples/distribute_deepwalk/local_config
+#!/bin/bash 
+export PADDLE_TRAINERS_NUM=2
+export PADDLE_PSERVERS_NUM=2
+export PADDLE_PORT=6184,6185
+export PADDLE_PSERVERS="127.0.0.1"
+export BASE="./local_dir"
+
--- a/examples/distribute_deepwalk/model.py
+++ b/examples/distribute_deepwalk/model.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    Deepwalk model file.
+"""
+from __future__ import division
+from __future__ import absolute_import
+from __future__ import print_function
+from __future__ import unicode_literals
+import math
+
+import paddle.fluid.layers as L
+import paddle.fluid as F
+
+
+def split_embedding(input,
+                    dict_size,
+                    hidden_size,
+                    initializer,
+                    name,
+                    num_part=16,
+                    is_sparse=False,
+                    learning_rate=1.0):
+    """ split_embedding
+    """
+    _part_size = hidden_size // num_part
+    if hidden_size % num_part != 0:
+        _part_size += 1
+    output_embedding = []
+    p_num = 0
+    while hidden_size > 0:
+        _part_size = min(_part_size, hidden_size)
+        hidden_size -= _part_size
+        print("part", p_num, "size=", (dict_size, _part_size))
+        part_embedding = L.embedding(
+            input=input,
+            size=(dict_size, _part_size),
+            is_sparse=is_sparse,
+            is_distributed=False,
+            param_attr=F.ParamAttr(
+                name=name + '_part%s' % p_num,
+                initializer=initializer,
+                learning_rate=learning_rate))
+        p_num += 1
+        output_embedding.append(part_embedding)
+    return L.concat(output_embedding, -1)
+
+
+class DeepwalkModel(object):
+    def __init__(self,
+                 num_nodes,
+                 hidden_size=16,
+                 neg_num=5,
+                 is_sparse=False,
+                 is_distributed=False,
+                 embedding_lr=1.0):
+        self.pyreader = L.py_reader(
+            capacity=70,
+            shapes=[[-1, 1, 1], [-1, neg_num + 1, 1]],
+            dtypes=['int64', 'int64'],
+            lod_levels=[0, 0],
+            name='train',
+            use_double_buffer=True)
+
+        self.num_nodes = num_nodes
+        self.neg_num = neg_num
+
+        self.embed_init = F.initializer.Uniform(
+            low=-1. / math.sqrt(hidden_size), high=1. / math.sqrt(hidden_size))
+        self.is_sparse = is_sparse
+        self.is_distributed = is_distributed
+        self.hidden_size = hidden_size
+        self.loss = None
+        self.embedding_lr = embedding_lr
+        max_hidden_size = int(math.pow(2, 31) / 4 / num_nodes)
+        self.num_part = int(math.ceil(1. * hidden_size / max_hidden_size))
+
+    def forward(self):
+        src, dsts = L.read_file(self.pyreader)
+
+        if self.is_sparse:
+            # sparse mode use 2 dims input.
+            src = L.reshape(src, [-1, 1])
+            dsts = L.reshape(dsts, [-1, 1])
+
+        if self.num_part is not None and self.num_part != 1 and not self.is_distributed:
+            src_embed = split_embedding(
+                src,
+                self.num_nodes,
+                self.hidden_size,
+                self.embed_init,
+                "weight",
+                self.num_part,
+                self.is_sparse,
+                learning_rate=self.embedding_lr)
+
+            dsts_embed = split_embedding(
+                dsts,
+                self.num_nodes,
+                self.hidden_size,
+                self.embed_init,
+                "weight",
+                self.num_part,
+                self.is_sparse,
+                learning_rate=self.embedding_lr)
+        else:
+            src_embed = L.embedding(
+                src, (self.num_nodes, self.hidden_size),
+                self.is_sparse,
+                self.is_distributed,
+                param_attr=F.ParamAttr(
+                    name="weight",
+                    learning_rate=self.embedding_lr,
+                    initializer=self.embed_init))
+
+            dsts_embed = L.embedding(
+                dsts, (self.num_nodes, self.hidden_size),
+                self.is_sparse,
+                self.is_distributed,
+                param_attr=F.ParamAttr(
+                    name="weight",
+                    learning_rate=self.embedding_lr,
+                    initializer=self.embed_init))
+
+        if self.is_sparse:
+            # reshape back
+            src_embed = L.reshape(src_embed, [-1, 1, self.hidden_size])
+            dsts_embed = L.reshape(dsts_embed,
+                                   [-1, self.neg_num + 1, self.hidden_size])
+
+        logits = L.matmul(
+            src_embed, dsts_embed,
+            transpose_y=True)  # [batch_size, 1, neg_num+1]
+
+        pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
+                                                    "float32", 1)
+        neg_label = L.fill_constant_batch_size_like(
+            logits, [-1, 1, self.neg_num], "float32", 0)
+        label = L.concat([pos_label, neg_label], -1)
+
+        pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
+                                                     "float32", self.neg_num)
+        neg_weight = L.fill_constant_batch_size_like(
+            logits, [-1, 1, self.neg_num], "float32", 1)
+        weight = L.concat([pos_weight, neg_weight], -1)
+
+        weight.stop_gradient = True
+        label.stop_gradient = True
+
+        loss = L.sigmoid_cross_entropy_with_logits(logits, label)
+        loss = loss * weight
+        loss = L.reduce_mean(loss)
+        loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
+        loss.persistable = True
+        self.loss = loss
+        return loss
--- a/examples/distribute_deepwalk/mp_reader.py
+++ b/examples/distribute_deepwalk/mp_reader.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimized Multiprocessing Reader for PaddlePaddle
+"""
+import multiprocessing
+import numpy as np
+import time
+
+import paddle.fluid as fluid
+import pyarrow
+
+
+def _serialize_serializable(obj):
+    """Serialize Feed Dict
+    """
+    return {"type": type(obj), "data": obj.__dict__}
+
+
+def _deserialize_serializable(obj):
+    """Deserialize Feed Dict
+    """
+
+    val = obj["type"].__new__(obj["type"])
+    val.__dict__.update(obj["data"])
+    return val
+
+
+context = pyarrow.default_serialization_context()
+
+context.register_type(
+    object,
+    "object",
+    custom_serializer=_serialize_serializable,
+    custom_deserializer=_deserialize_serializable)
+
+
+def serialize_data(data):
+    """serialize_data"""
+    return pyarrow.serialize(data, context=context).to_buffer().to_pybytes()
+
+
+def deserialize_data(data):
+    """deserialize_data"""
+    return pyarrow.deserialize(data, context=context)
+
+
+def multiprocess_reader(readers, use_pipe=True, queue_size=1000):
+    """
+    multiprocess_reader use python multi process to read data from readers
+    and then use multiprocess.Queue or multiprocess.Pipe to merge all
+    data. The process number is equal to the number of input readers, each
+    process call one reader.
+    Multiprocess.Queue require the rw access right to /dev/shm, some
+    platform does not support.
+    you need to create multiple readers first, these readers should be independent
+    to each other so that each process can work independently.
+    An example:
+    .. code-block:: python
+        reader0 = reader(["file01", "file02"])
+        reader1 = reader(["file11", "file12"])
+        reader1 = reader(["file21", "file22"])
+        reader = multiprocess_reader([reader0, reader1, reader2],
+            queue_size=100, use_pipe=False)
+    """
+
+    assert type(readers) is list and len(readers) > 0
+
+    def _read_into_queue(reader, queue):
+        """read_into_queue"""
+        for sample in reader():
+            if sample is None:
+                raise ValueError("sample has None")
+            queue.put(serialize_data(sample))
+        queue.put(serialize_data(None))
+
+    def queue_reader():
+        """queue_reader"""
+        queue = multiprocessing.Queue(queue_size)
+        for reader in readers:
+            p = multiprocessing.Process(
+                target=_read_into_queue, args=(reader, queue))
+            p.start()
+
+        reader_num = len(readers)
+        finish_num = 0
+        while finish_num < reader_num:
+            sample = deserialize_data(queue.get())
+            if sample is None:
+                finish_num += 1
+            else:
+                yield sample
+
+    def _read_into_pipe(reader, conn):
+        """read_into_pipe"""
+        for sample in reader():
+            if sample is None:
+                raise ValueError("sample has None!")
+            conn.send(serialize_data(sample))
+        conn.send(serialize_data(None))
+        conn.close()
+
+    def pipe_reader():
+        """pipe_reader"""
+        conns = []
+        for reader in readers:
+            parent_conn, child_conn = multiprocessing.Pipe()
+            conns.append(parent_conn)
+            p = multiprocessing.Process(
+                target=_read_into_pipe, args=(reader, child_conn))
+            p.start()
+
+        reader_num = len(readers)
+        finish_num = 0
+        conn_to_remove = []
+        finish_flag = np.zeros(len(conns), dtype="int32")
+        while finish_num < reader_num:
+            for conn_id, conn in enumerate(conns):
+                if finish_flag[conn_id] > 0:
+                    continue
+                buff = conn.recv()
+                now = time.time()
+                sample = deserialize_data(buff)
+                out = time.time() - now
+                if sample is None:
+                    finish_num += 1
+                    conn.close()
+                    finish_flag[conn_id] = 1
+                else:
+                    yield sample
+
+    if use_pipe:
+        return pipe_reader
+    else:
+        return queue_reader
--- a/examples/distribute_deepwalk/multi_class.py
+++ b/examples/distribute_deepwalk/multi_class.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+import math
+import os
+
+import numpy as np
+import sklearn.metrics
+from sklearn.metrics import f1_score
+
+import pgl
+from pgl import data_loader
+from pgl.utils import op
+from pgl.utils.logger import log
+import paddle.fluid as fluid
+import paddle.fluid.layers as l
+
+np.random.seed(123)
+
+
+def load(name):
+    if name == 'BlogCatalog':
+        dataset = data_loader.BlogCatalogDataset()
+    else:
+        raise ValueError(name + " dataset doesn't exists")
+    return dataset
+
+
+def node_classify_model(graph,
+                        num_labels,
+                        hidden_size=16,
+                        name='node_classify_task'):
+    pyreader = l.py_reader(
+        capacity=70,
+        shapes=[[-1, 1], [-1, num_labels]],
+        dtypes=['int64', 'float32'],
+        lod_levels=[0, 0],
+        name=name + '_pyreader',
+        use_double_buffer=True)
+    nodes, labels = l.read_file(pyreader)
+    embed_nodes = l.embedding(
+        input=nodes,
+        size=[graph.num_nodes, hidden_size],
+        param_attr=fluid.ParamAttr(name='weight'))
+    embed_nodes.stop_gradient = True
+    logits = l.fc(input=embed_nodes, size=num_labels)
+    loss = l.sigmoid_cross_entropy_with_logits(logits, labels)
+    loss = l.reduce_mean(loss)
+    prob = l.sigmoid(logits)
+    topk = l.reduce_sum(labels, -1)
+    return pyreader, loss, prob, labels, topk
+
+
+def node_classify_generator(graph,
+                            all_nodes=None,
+                            batch_size=512,
+                            epoch=1,
+                            shuffle=True):
+
+    if all_nodes is None:
+        all_nodes = np.arange(graph.num_nodes)
+    #labels = (np.random.rand(512, 39) > 0.95).astype(np.float32)
+
+    def batch_nodes_generator(shuffle=shuffle):
+        perm = np.arange(len(all_nodes), dtype=np.int64)
+        if shuffle:
+            np.random.shuffle(perm)
+        start = 0
+        while start < len(all_nodes):
+            yield all_nodes[perm[start:start + batch_size]]
+            start += batch_size
+
+    def wrapper():
+        for _ in range(epoch):
+            for batch_nodes in batch_nodes_generator():
+                batch_nodes_expanded = np.expand_dims(batch_nodes,
+                                                      -1).astype(np.int64)
+                batch_labels = graph.node_feat['group_id'][batch_nodes].astype(
+                    np.float32)
+                yield [batch_nodes_expanded, batch_labels]
+
+    return wrapper
+
+
+def topk_f1_score(labels,
+                  probs,
+                  topk_list=None,
+                  average="macro",
+                  threshold=None):
+    assert topk_list is not None or threshold is not None, "one of topklist and threshold should not be None"
+    if threshold is not None:
+        preds = probs > threshold
+    else:
+        preds = np.zeros_like(labels, dtype=np.int64)
+        for idx, (prob, topk) in enumerate(zip(np.argsort(probs), topk_list)):
+            preds[idx][prob[-int(topk):]] = 1
+    return f1_score(labels, preds, average=average)
+
+
+def main(args):
+    hidden_size = args.hidden_size
+    epoch = args.epoch
+    ckpt_path = args.ckpt_path
+    threshold = args.threshold
+
+    dataset = load(args.dataset)
+
+    if args.batch_size is None:
+        batch_size = len(dataset.train_index)
+    else:
+        batch_size = args.batch_size
+
+    train_steps = (len(dataset.train_index) // batch_size) * epoch
+
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    train_prog = fluid.Program()
+    test_prog = fluid.Program()
+    startup_prog = fluid.Program()
+
+    with fluid.program_guard(train_prog, startup_prog):
+        with fluid.unique_name.guard():
+            train_pyreader, train_loss, train_probs, train_labels, train_topk = node_classify_model(
+                dataset.graph,
+                dataset.num_groups,
+                hidden_size=hidden_size,
+                name='train')
+            lr = l.polynomial_decay(0.025, train_steps, 0.0001)
+            adam = fluid.optimizer.Adam(lr)
+            adam.minimize(train_loss)
+
+    with fluid.program_guard(test_prog, startup_prog):
+        with fluid.unique_name.guard():
+            test_pyreader, test_loss, test_probs, test_labels, test_topk = node_classify_model(
+                dataset.graph,
+                dataset.num_groups,
+                hidden_size=hidden_size,
+                name='test')
+    test_prog = test_prog.clone(for_test=True)
+
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+
+    train_pyreader.decorate_tensor_provider(
+        node_classify_generator(
+            dataset.graph,
+            dataset.train_index,
+            batch_size=batch_size,
+            epoch=epoch))
+    test_pyreader.decorate_tensor_provider(
+        node_classify_generator(
+            dataset.graph, dataset.test_index, batch_size=batch_size, epoch=1))
+
+    def existed_params(var):
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        return os.path.exists(os.path.join(ckpt_path, var.name))
+
+    fluid.io.load_vars(
+        exe, ckpt_path, main_program=train_prog, predicate=existed_params)
+    step = 0
+    prev_time = time.time()
+    train_pyreader.start()
+
+    while 1:
+        try:
+            train_loss_val, train_probs_val, train_labels_val, train_topk_val = exe.run(
+                train_prog,
+                fetch_list=[
+                    train_loss, train_probs, train_labels, train_topk
+                ],
+                return_numpy=True)
+            train_macro_f1 = topk_f1_score(train_labels_val, train_probs_val,
+                                           train_topk_val, "macro", threshold)
+            train_micro_f1 = topk_f1_score(train_labels_val, train_probs_val,
+                                           train_topk_val, "micro", threshold)
+            step += 1
+            log.info("Step %d " % step + "Train Loss: %f " % train_loss_val +
+                     "Train Macro F1: %f " % train_macro_f1 +
+                     "Train Micro F1: %f " % train_micro_f1)
+        except fluid.core.EOFException:
+            train_pyreader.reset()
+            break
+
+        test_pyreader.start()
+        test_probs_vals, test_labels_vals, test_topk_vals = [], [], []
+        while 1:
+            try:
+                test_loss_val, test_probs_val, test_labels_val, test_topk_val = exe.run(
+                    test_prog,
+                    fetch_list=[
+                        test_loss, test_probs, test_labels, test_topk
+                    ],
+                    return_numpy=True)
+                test_probs_vals.append(
+                    test_probs_val), test_labels_vals.append(test_labels_val)
+                test_topk_vals.append(test_topk_val)
+            except fluid.core.EOFException:
+                test_pyreader.reset()
+                test_probs_array = np.concatenate(test_probs_vals)
+                test_labels_array = np.concatenate(test_labels_vals)
+                test_topk_array = np.concatenate(test_topk_vals)
+                test_macro_f1 = topk_f1_score(
+                    test_labels_array, test_probs_array, test_topk_array,
+                    "macro", threshold)
+                test_micro_f1 = topk_f1_score(
+                    test_labels_array, test_probs_array, test_topk_array,
+                    "micro", threshold)
+                log.info("\t\tStep %d " % step + "Test Loss: %f " %
+                         test_loss_val + "Test Macro F1: %f " % test_macro_f1 +
+                         "Test Micro F1: %f " % test_micro_f1)
+                break
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='node2vec')
+    parser.add_argument(
+        "--dataset",
+        type=str,
+        default="BlogCatalog",
+        help="dataset (BlogCatalog)")
+    parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
+    parser.add_argument("--hidden_size", type=int, default=128)
+    parser.add_argument("--epoch", type=int, default=400)
+    parser.add_argument("--batch_size", type=int, default=None)
+    parser.add_argument("--threshold", type=float, default=0.3)
+    parser.add_argument(
+        "--ckpt_path",
+        type=str,
+        default="./tmp/baseline_node2vec/paddle_model")
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/distribute_deepwalk/pgl_deepwalk.cfg
+++ b/examples/distribute_deepwalk/pgl_deepwalk.cfg
+
+# deepwalk config
+num_nodes=10312 # max node_id + 1
+num_sample_workers=2
+epoch=100
+
+optimizer=sgd # sgd or adam
+learning_rate=0.5
+
+neg_num=5
+walk_len=40
+win_size=5
+dim=128
+batch_size=8
+steps_per_save=5000
+
+is_sparse=False
+distributed_embedding=False # only use when num_nodes > 100,000,000, slower than noraml embedding
+build_train_data=True
+pre_walk=False
+
+CPU_NUM=16
--- a/examples/distribute_deepwalk/reader.py
+++ b/examples/distribute_deepwalk/reader.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    Reader file.
+"""
+from __future__ import division
+from __future__ import absolute_import
+from __future__ import print_function
+import time
+import io
+import os
+
+import numpy as np
+import paddle
+from pgl.utils.logger import log
+from pgl.sample import node2vec_sample
+from pgl.sample import deepwalk_sample
+from pgl.sample import alias_sample
+from pgl.graph_kernel import skip_gram_gen_pair
+from pgl.graph_kernel import alias_sample_build_table
+from pgl.utils import mp_reader
+
+
+class DeepwalkReader(object):
+    def __init__(self,
+                 graph,
+                 batch_size=512,
+                 walk_len=40,
+                 win_size=5,
+                 neg_num=5,
+                 train_files=None,
+                 walkpath_files=None,
+                 neg_sample_type="average"):
+        """
+        Args:
+            walkpath_files: if is not None, read walk path from walkpath_files
+        """
+        self.graph = graph
+        self.batch_size = batch_size
+        self.walk_len = walk_len
+        self.win_size = win_size
+        self.neg_num = neg_num
+        self.train_files = train_files
+        self.walkpath_files = walkpath_files
+        self.neg_sample_type = neg_sample_type
+
+    def walk_from_files(self):
+        bucket = []
+        while True:
+            for filename in self.walkpath_files:
+                with io.open(filename) as inf:
+                    for line in inf:
+                        #walk = [hash_map[x] for x in line.strip('\n\t').split('\t')]
+                        walk = [int(x) for x in line.strip('\n\t').split('\t')]
+                        bucket.append(walk)
+                        if len(bucket) == self.batch_size:
+                            yield bucket
+                            bucket = []
+            if len(bucket):
+                yield bucket
+
+    def walk_from_graph(self):
+        def node_generator():
+            if self.train_files is None:
+                while True:
+                    for nodes in self.graph.node_batch_iter(self.batch_size):
+                        yield nodes
+            else:
+                nodes = []
+                while True:
+                    for filename in self.train_files:
+                        with io.open(filename) as inf:
+                            for line in inf:
+                                node = int(line.strip('\n\t'))
+                                nodes.append(node)
+                                if len(nodes) == self.batch_size:
+                                    yield nodes
+                                    nodes = []
+                if len(nodes):
+                    yield nodes
+
+        if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
+            log.info("Deepwalk using alias sample")
+        for nodes in node_generator():
+            if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
+                walks = deepwalk_sample(self.graph, nodes, self.walk_len,
+                                        "alias", "events")
+            else:
+                walks = deepwalk_sample(self.graph, nodes, self.walk_len)
+            yield walks
+
+    def walk_generator(self):
+        if self.walkpath_files is not None:
+            for i in self.walk_from_files():
+                yield i
+        else:
+            for i in self.walk_from_graph():
+                yield i
+
+    def __call__(self):
+        np.random.seed(os.getpid())
+        if self.neg_sample_type == "outdegree":
+            outdegree = self.graph.outdegree()
+            distribution = 1. * outdegree / outdegree.sum()
+            alias, events = alias_sample_build_table(distribution)
+        max_len = int(self.batch_size * self.walk_len * (
+            (1 + self.win_size) - 0.3))
+        for walks in self.walk_generator():
+            try:
+                src_list, pos_list = [], []
+                for walk in walks:
+                    s, p = skip_gram_gen_pair(walk, self.win_size)
+                    src_list.append(s[:max_len]), pos_list.append(p[:max_len])
+                src = [s for x in src_list for s in x]
+                pos = [s for x in pos_list for s in x]
+                src = np.array(src, dtype=np.int64),
+                pos = np.array(pos, dtype=np.int64)
+                src, pos = np.reshape(src, [-1, 1, 1]), np.reshape(pos,
+                                                                   [-1, 1, 1])
+
+                neg_sample_size = [len(pos), self.neg_num, 1]
+                if src.shape[0] == 0:
+                    continue
+                if self.neg_sample_type == "average":
+                    negs = np.random.randint(
+                        low=0, high=self.graph.num_nodes, size=neg_sample_size)
+                elif self.neg_sample_type == "outdegree":
+                    negs = alias_sample(neg_sample_size, alias, events)
+                elif self.neg_sample_type == "inbatch":
+                    pass
+                dst = np.concatenate([pos, negs], 1)
+                # [batch_size, 1, 1] [batch_size, neg_num+1, 1]
+                yield src[:max_len], dst[:max_len]
+            except Exception as e:
+                log.exception(e)
--- a/examples/distribute_deepwalk/utils.py
+++ b/examples/distribute_deepwalk/utils.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    Utils file.
+"""
+from __future__ import division
+from __future__ import absolute_import
+from __future__ import print_function
+from __future__ import unicode_literals
+
+import os
+import time
+
+import numpy as np
+from pgl.utils.logger import log
+from pgl.graph import Graph
+from pgl.sample import graph_alias_sample_table
+
+from reader import DeepwalkReader
+import mp_reader
+
+
+def get_file_list(path):
+    filelist = []
+    if os.path.isfile(path):
+        filelist = [path]
+    elif os.path.isdir(path):
+        filelist = [
+            os.path.join(dp, f)
+            for dp, dn, filenames in os.walk(path) for f in filenames
+        ]
+    else:
+        raise ValueError(path + " not supported")
+    return filelist
+
+
+def build_graph(num_nodes, edge_path):
+    filelist = []
+    if os.path.isfile(edge_path):
+        filelist = [edge_path]
+    elif os.path.isdir(edge_path):
+        filelist = [
+            os.path.join(dp, f)
+            for dp, dn, filenames in os.walk(edge_path) for f in filenames
+        ]
+    else:
+        raise ValueError(edge_path + " not supported")
+    edges, edge_weight = [], []
+    for name in filelist:
+        with open(name) as inf:
+            for line in inf:
+                slots = line.strip("\n").split()
+                edges.append([slots[0], slots[1]])
+                edges.append([slots[1], slots[0]])
+                if len(slots) > 2:
+                    edge_weight.extend([float(slots[2]), float(slots[2])])
+    edges = np.array(edges, dtype="int64")
+    assert num_nodes > edges.max(
+    ), "Node id in any edges should be smaller then num_nodes!"
+
+    edge_feat = dict()
+    if len(edge_weight) == len(edges):
+        edge_feat["weight"] = np.array(edge_weight)
+
+    graph = Graph(num_nodes, edges, edge_feat=edge_feat)
+    log.info("Build graph done")
+
+    graph.outdegree()
+
+    del edges, edge_feat
+
+    log.info("Build graph index done")
+    if "weight" in graph.edge_feat:
+        graph.node_feat["alias"], graph.node_feat[
+            "events"] = graph_alias_sample_table(graph, "weight")
+        log.info("Build graph alias sample table done")
+    return graph
+
+
+def build_fake_graph(num_nodes):
+    class FakeGraph():
+        pass
+
+    graph = FakeGraph()
+    graph.num_nodes = num_nodes
+    return graph
+
+
+def build_gen_func(args, graph):
+    num_sample_workers = args.num_sample_workers
+
+    if args.walkpath_files is None or args.walkpath_files == "None":
+        walkpath_files = [None for _ in range(num_sample_workers)]
+    else:
+        files = get_file_list(args.walkpath_files)
+        walkpath_files = [[] for i in range(num_sample_workers)]
+        for idx, f in enumerate(files):
+            walkpath_files[idx % num_sample_workers].append(f)
+
+    if args.train_files is None or args.train_files == "None":
+        train_files = [None for _ in range(num_sample_workers)]
+    else:
+        files = get_file_list(args.train_files)
+        train_files = [[] for i in range(num_sample_workers)]
+        for idx, f in enumerate(files):
+            train_files[idx % num_sample_workers].append(f)
+
+    gen_func_pool = [
+        DeepwalkReader(
+            graph,
+            batch_size=args.batch_size,
+            walk_len=args.walk_len,
+            win_size=args.win_size,
+            neg_num=args.neg_num,
+            neg_sample_type=args.neg_sample_type,
+            walkpath_files=walkpath_files[i],
+            train_files=train_files[i]) for i in range(num_sample_workers)
+    ]
+    if num_sample_workers == 1:
+        gen_func = gen_func_pool[0]
+    else:
+        gen_func = mp_reader.multiprocess_reader(
+            gen_func_pool, use_pipe=True, queue_size=100)
+    return gen_func
+
+
+def test_gen_speed(gen_func):
+    cur_time = time.time()
+    for idx, _ in enumerate(gen_func()):
+        log.info("iter %s: %s s" % (idx, time.time() - cur_time))
+        cur_time = time.time()
+        if idx == 100:
+            break
--- a/docs/source/examples/md/graphsage_examples.md
+++ b/docs/source/examples/md/graphsage_examples.md
-# GraphSage for Large Scale Networks
+# Distribute GraphSAGE in PGL

 [GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
 information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.

-### Datasets
-The reddit dataset should be downloaded from the following links and placed in directory ```./data```. The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
+For purpose of high scalability, we use redis as distribute graph storage solution and training graphsage against redis server.

- reddit.npz https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
- reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
+### Datasets(Quickstart)
+The reddit dataset should be downloaded from [reddit_adj.npz](https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt) and [reddit.npz](https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2Jthe). The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).

+Alternatively, reddit dataset has been preprocessed and packed into docker image, which can be instantly pulled using following commands.
+
+```sh
+docker pull githubutilities/reddit_redis_demo:v0.1
+```

 ### Dependencies

- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
+```txt
+- paddlepaddle>=1.6
 - pgl
+- scipy
+- redis==2.10.6
+- redis-py-cluster==1.3.6
+```

 ### How to run

-To train a GraphSAGE model on Reddit Dataset, you can just run
+#### 1. Start reddit data service
+
+```sh
+docker run \
+    --net=host \
+    -d --rm \
+    --name reddit_demo \
+    -it githubutilities/reddit_redis_demo:v0.1 \
+    /bin/bash -c "/bin/bash ./before_hook.sh && /bin/bash"
+docker logs -f `docker ps -aqf "name=reddit_demo"`
 ```
- python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry     
+
+#### 2. training GraphSAGE model
+
+```sh
+python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --sample_workers 10
 ```

 #### Hyperparameters
@@ -27,28 +49,11 @@ To train a GraphSAGE model on Reddit Dataset, you can just run
 - epoch: Number of epochs default (10)
 - use_cuda: Use gpu if assign use_cuda. 
 - graphsage_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm".
- normalize: Normalize the input feature if assign normalize.
 - sample_workers: The number of workers for multiprocessing subgraph sample.
 - lr: Learning rate.
- symmetry: Make the edges symmetric if assign symmetry.
 - batch_size: Batch size.
 - samples_1: The max neighbors for the first hop neighbor sampling. (default: 25)
 - samples_2: The max neighbors for the second hop neighbor sampling. (default: 10)
 - hidden_size: The hidden size of the GraphSAGE models.


-### Performance
-
-We train our models for 200 epochs and report the accuracy on the test dataset.
-
-
-| Aggregator | Accuracy   | Reported in paper |
-| --- | --- | --- |
-| Mean | 95.70% |  95.0% | 
-| Meanpool | 95.60% | 94.8% |
-| Maxpool | 94.95%  | 94.8% |
-| LSTM | 95.13% | 95.4% |
-
-### View the Code
-
-See the code [here](graphsage_examples_code.html).
--- a/examples/distribute_graphsage/data/reddit_index_label.npz
+++ b/examples/distribute_graphsage/data/reddit_index_label.npz
--- a/examples/distribute_graphsage/model.py
+++ b/examples/distribute_graphsage/model.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+import paddle.fluid as fluid
+
+
+def copy_send(src_feat, dst_feat, edge_feat):
+    return src_feat["h"]
+
+
+def mean_recv(feat):
+    return fluid.layers.sequence_pool(feat, pool_type="average")
+
+
+def sum_recv(feat):
+    return fluid.layers.sequence_pool(feat, pool_type="sum")
+
+
+def max_recv(feat):
+    return fluid.layers.sequence_pool(feat, pool_type="max")
+
+
+def lstm_recv(feat):
+    hidden_dim = 128
+    forward, _ = fluid.layers.dynamic_lstm(
+        input=feat, size=hidden_dim * 4, use_peepholes=False)
+    output = fluid.layers.sequence_last_step(forward)
+    return output
+
+
+def graphsage_mean(gw, feature, hidden_size, act, name):
+    msg = gw.send(copy_send, nfeat_list=[("h", feature)])
+    neigh_feature = gw.recv(msg, mean_recv)
+    self_feature = feature
+    self_feature = fluid.layers.fc(self_feature,
+                                   hidden_size,
+                                   act=act,
+                                   name=name + '_l')
+    neigh_feature = fluid.layers.fc(neigh_feature,
+                                    hidden_size,
+                                    act=act,
+                                    name=name + '_r')
+    output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
+    output = fluid.layers.l2_normalize(output, axis=1)
+    return output
+
+
+def graphsage_meanpool(gw,
+                       feature,
+                       hidden_size,
+                       act,
+                       name,
+                       inner_hidden_size=512):
+    neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
+    msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
+    neigh_feature = gw.recv(msg, mean_recv)
+    neigh_feature = fluid.layers.fc(neigh_feature,
+                                    hidden_size,
+                                    act=act,
+                                    name=name + '_r')
+
+    self_feature = feature
+    self_feature = fluid.layers.fc(self_feature,
+                                   hidden_size,
+                                   act=act,
+                                   name=name + '_l')
+    output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
+    output = fluid.layers.l2_normalize(output, axis=1)
+    return output
+
+
+def graphsage_maxpool(gw,
+                      feature,
+                      hidden_size,
+                      act,
+                      name,
+                      inner_hidden_size=512):
+    neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
+    msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
+    neigh_feature = gw.recv(msg, max_recv)
+    neigh_feature = fluid.layers.fc(neigh_feature,
+                                    hidden_size,
+                                    act=act,
+                                    name=name + '_r')
+
+    self_feature = feature
+    self_feature = fluid.layers.fc(self_feature,
+                                   hidden_size,
+                                   act=act,
+                                   name=name + '_l')
+    output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
+    output = fluid.layers.l2_normalize(output, axis=1)
+    return output
+
+
+def graphsage_lstm(gw, feature, hidden_size, act, name):
+    inner_hidden_size = 128
+    neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
+
+    hidden_dim = 128
+    forward_proj = fluid.layers.fc(input=neigh_feature,
+                                   size=hidden_dim * 4,
+                                   bias_attr=False,
+                                   name="lstm_proj")
+    msg = gw.send(copy_send, nfeat_list=[("h", forward_proj)])
+    neigh_feature = gw.recv(msg, lstm_recv)
+    neigh_feature = fluid.layers.fc(neigh_feature,
+                                    hidden_size,
+                                    act=act,
+                                    name=name + '_r')
+
+    self_feature = feature
+    self_feature = fluid.layers.fc(self_feature,
+                                   hidden_size,
+                                   act=act,
+                                   name=name + '_l')
+    output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
+    output = fluid.layers.l2_normalize(output, axis=1)
+    return output
--- a/examples/distribute_graphsage/reader.py
+++ b/examples/distribute_graphsage/reader.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import pickle as pkl
+import paddle
+import paddle.fluid as fluid
+import socket
+import pgl
+import time
+
+from pgl.utils import mp_reader
+from pgl.utils.logger import log
+from pgl import redis_graph
+
+
+def node_batch_iter(nodes, node_label, batch_size):
+    """node_batch_iter
+    """
+    perm = np.arange(len(nodes))
+    np.random.shuffle(perm)
+    start = 0
+    while start < len(nodes):
+        index = perm[start:start + batch_size]
+        start += batch_size
+        yield nodes[index], node_label[index]
+
+
+def traverse(item):
+    """traverse
+    """
+    if isinstance(item, list) or isinstance(item, np.ndarray):
+        for i in iter(item):
+            for j in traverse(i):
+                yield j
+    else:
+        yield item
+
+
+def flat_node_and_edge(nodes, eids):
+    """flat_node_and_edge
+    """
+    nodes = list(set(traverse(nodes)))
+    eids = list(set(traverse(eids)))
+    return nodes, eids
+
+
+def worker(batch_info, graph_wrapper, samples):
+    """Worker
+    """
+
+    def work():
+        """work
+        """
+        redis_configs = [{
+            "host": socket.gethostbyname(socket.gethostname()),
+            "port": 7430
+        }, ]
+        graph = redis_graph.RedisGraph("sub_graph", redis_configs, 64)
+        first = True
+        for batch_train_samples, batch_train_labels in batch_info:
+            start_nodes = batch_train_samples
+            nodes = start_nodes
+            eids = []
+            eid2edges = {}
+            for max_deg in samples:
+                pred, pred_eid = graph.sample_predecessor(
+                    start_nodes, max_degree=max_deg, return_eids=True)
+                for _dst, _srcs, _eids in zip(start_nodes, pred, pred_eid):
+                    for _src, _eid in zip(_srcs, _eids):
+                        eid2edges[_eid] = (_src, _dst)
+
+                last_nodes = nodes
+                nodes = [nodes, pred]
+                eids = [eids, pred_eid]
+                nodes, eids = flat_node_and_edge(nodes, eids)
+                # Find new nodes
+                start_nodes = list(set(nodes) - set(last_nodes))
+                if len(start_nodes) == 0:
+                    break
+
+            subgraph = graph.subgraph(nodes=nodes, eid=eids, edges=[ eid2edges[e] for e in eids])
+            sub_node_index = subgraph.reindex_from_parrent_nodes(
+                batch_train_samples)
+            feed_dict = graph_wrapper.to_feed(subgraph)
+            feed_dict["node_label"] = np.expand_dims(
+                np.array(
+                    batch_train_labels, dtype="int64"), -1)
+            feed_dict["node_index"] = sub_node_index
+            yield feed_dict
+
+    return work
+
+
+def multiprocess_graph_reader(
+                              graph_wrapper,
+                              samples,
+                              node_index,
+                              batch_size,
+                              node_label,
+                              num_workers=4):
+    """multiprocess_graph_reader
+    """
+
+    def parse_to_subgraph(rd):
+        """parse_to_subgraph
+        """
+
+        def work():
+            """work
+            """
+            last = time.time()
+            for data in rd():
+                this = time.time()
+                feed_dict = data
+                now = time.time()
+                last = now
+                yield feed_dict
+
+        return work
+
+    def reader():
+        """reader"""
+        batch_info = list(
+            node_batch_iter(
+                node_index, node_label, batch_size=batch_size))
+        block_size = int(len(batch_info) / num_workers + 1)
+        reader_pool = []
+        for i in range(num_workers):
+            reader_pool.append(
+                worker(batch_info[block_size * i:block_size * (i + 1)], 
+                       graph_wrapper, samples))
+        multi_process_sample = mp_reader.multiprocess_reader(
+            reader_pool, use_pipe=True, queue_size=1000)
+        r = parse_to_subgraph(multi_process_sample)
+        return paddle.reader.buffered(r, 1000)
+
+    return reader()
+
--- a/examples/distribute_graphsage/requirements.txt
+++ b/examples/distribute_graphsage/requirements.txt
+scipy
+redis==2.10.6
+redis-py-cluster==1.3.6
+
--- a/examples/distribute_graphsage/train.py
+++ b/examples/distribute_graphsage/train.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import time
+
+import numpy as np
+import scipy.sparse as sp
+from sklearn.preprocessing import StandardScaler
+
+import pgl
+from pgl.utils.logger import log
+from pgl.utils import paddle_helper
+import paddle
+import paddle.fluid as fluid
+import reader
+from model import graphsage_mean, graphsage_meanpool,\
+        graphsage_maxpool, graphsage_lstm
+
+
+def load_data():
+    """
+        data from https://github.com/matenure/FastGCN/issues/8
+        reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
+        reddit_index_label is preprocess from reddit.npz without feats key.
+    """
+    data_dir = os.path.dirname(os.path.abspath(__file__))
+    data = np.load(os.path.join(data_dir, "data/reddit_index_label.npz"))
+
+    num_class = 41
+
+    train_label = data['y_train']
+    val_label = data['y_val']
+    test_label = data['y_test']
+
+    train_index = data['train_index']
+    val_index = data['val_index']
+    test_index = data['test_index']
+
+    return {
+        "train_index": train_index,
+        "train_label": train_label,
+        "val_label": val_label,
+        "val_index": val_index,
+        "test_index": test_index,
+        "test_label": test_label,
+        "num_class": 41
+    }
+
+
+def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
+                      hidden_size):
+    node_index = fluid.layers.data(
+        "node_index", shape=[None], dtype="int64", append_batch_size=False)
+
+    node_label = fluid.layers.data(
+        "node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
+
+    #feature = fluid.layers.gather(feature, graph_wrapper.node_feat['feats'])
+    feature = graph_wrapper.node_feat['feats']
+    feature.stop_gradient = True
+
+    for i in range(k_hop):
+        if graphsage_type == 'graphsage_mean':
+            feature = graphsage_mean(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_mean_%s" % i)
+        elif graphsage_type == 'graphsage_meanpool':
+            feature = graphsage_meanpool(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_meanpool_%s" % i)
+        elif graphsage_type == 'graphsage_maxpool':
+            feature = graphsage_maxpool(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_maxpool_%s" % i)
+        elif graphsage_type == 'graphsage_lstm':
+            feature = graphsage_lstm(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_maxpool_%s" % i)
+        else:
+            raise ValueError("graphsage type %s is not"
+                             " implemented" % graphsage_type)
+
+    feature = fluid.layers.gather(feature, node_index)
+    logits = fluid.layers.fc(feature,
+                             num_class,
+                             act=None,
+                             name='classification_layer')
+    proba = fluid.layers.softmax(logits)
+
+    loss = fluid.layers.softmax_with_cross_entropy(
+        logits=logits, label=node_label)
+    loss = fluid.layers.mean(loss)
+    acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
+    return loss, acc
+
+
+def run_epoch(batch_iter,
+              exe,
+              program,
+              prefix,
+              model_loss,
+              model_acc,
+              epoch,
+              log_per_step=100):
+    batch = 0
+    total_loss = 0.
+    total_acc = 0.
+    total_sample = 0
+    start = time.time()
+    for batch_feed_dict in batch_iter():
+        batch += 1
+        batch_loss, batch_acc = exe.run(program,
+                                        fetch_list=[model_loss, model_acc],
+                                        feed=batch_feed_dict)
+
+        if batch % log_per_step == 0:
+            log.info("Batch %s %s-Loss %s %s-Acc %s" %
+                     (batch, prefix, batch_loss, prefix, batch_acc))
+
+        num_samples = len(batch_feed_dict["node_index"])
+        total_loss += batch_loss * num_samples
+        total_acc += batch_acc * num_samples
+        total_sample += num_samples
+    end = time.time()
+
+    log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
+             (prefix, epoch, total_loss / total_sample,
+              total_acc / total_sample, (end - start) / batch))
+
+
+def main(args):
+    data = load_data()
+    log.info("preprocess finish")
+    log.info("Train Examples: %s" % len(data["train_index"]))
+    log.info("Val Examples: %s" % len(data["val_index"]))
+    log.info("Test Examples: %s" % len(data["test_index"]))
+
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    samples = []
+    if args.samples_1 > 0:
+        samples.append(args.samples_1)
+    if args.samples_2 > 0:
+        samples.append(args.samples_2)
+
+    with fluid.program_guard(train_program, startup_program):
+        graph_wrapper = pgl.graph_wrapper.GraphWrapper(
+            "sub_graph", fluid.CPUPlace(), node_feat=[('feats', [None, 602], np.dtype('float32'))])
+        model_loss, model_acc = build_graph_model(
+            graph_wrapper,
+            num_class=data["num_class"],
+            hidden_size=args.hidden_size,
+            graphsage_type=args.graphsage_type,
+            k_hop=len(samples))
+
+    test_program = train_program.clone(for_test=True)
+
+    with fluid.program_guard(train_program, startup_program):
+        adam = fluid.optimizer.Adam(learning_rate=args.lr)
+        adam.minimize(model_loss)
+
+    exe = fluid.Executor(place)
+    exe.run(startup_program)
+
+    train_iter = reader.multiprocess_graph_reader(
+        graph_wrapper,
+        samples=samples,
+        num_workers=args.sample_workers,
+        batch_size=args.batch_size,
+        node_index=data['train_index'],
+        node_label=data["train_label"])
+
+    val_iter = reader.multiprocess_graph_reader(
+        graph_wrapper,
+        samples=samples,
+        num_workers=args.sample_workers,
+        batch_size=args.batch_size,
+        node_index=data['val_index'],
+        node_label=data["val_label"])
+
+    test_iter = reader.multiprocess_graph_reader(
+        graph_wrapper,
+        samples=samples,
+        num_workers=args.sample_workers,
+        batch_size=args.batch_size,
+        node_index=data['test_index'],
+        node_label=data["test_label"])
+
+    for epoch in range(args.epoch):
+        run_epoch(
+            train_iter,
+            program=train_program,
+            exe=exe,
+            prefix="train",
+            model_loss=model_loss,
+            model_acc=model_acc,
+            log_per_step=1,
+            epoch=epoch)
+
+        run_epoch(
+            val_iter,
+            program=test_program,
+            exe=exe,
+            prefix="val",
+            model_loss=model_loss,
+            model_acc=model_acc,
+            log_per_step=10000,
+            epoch=epoch)
+
+    run_epoch(
+        test_iter,
+        program=test_program,
+        prefix="test",
+        exe=exe,
+        model_loss=model_loss,
+        model_acc=model_acc,
+        log_per_step=10000,
+        epoch=epoch)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='graphsage')
+    parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
+    parser.add_argument(
+        "--normalize", action='store_true', help="normalize features")
+    parser.add_argument(
+        "--symmetry", action='store_true', help="undirect graph")
+    parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
+    parser.add_argument("--sample_workers", type=int, default=10)
+    parser.add_argument("--epoch", type=int, default=10)
+    parser.add_argument("--hidden_size", type=int, default=128)
+    parser.add_argument("--batch_size", type=int, default=128)
+    parser.add_argument("--lr", type=float, default=0.01)
+    parser.add_argument("--samples_1", type=int, default=25)
+    parser.add_argument("--samples_2", type=int, default=10)
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/distribute_metapath2vec/README.md
+++ b/examples/distribute_metapath2vec/README.md
+# Distributed metapath2vec, metapath2vec++, multi-metapath2vec++ in PGL
+[metapath2vec](https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf) is a algorithm framework for representation learning in heterogeneous networks which contains multiple types of nodes and links. Given a heterogeneous graph, metapath2vec algorithm first generates meta-path-based random walks and then use skipgram model to train a language model. Based on PGL, we reproduce metapath2vec algorithm in distributed mode.
+
+
+### Datasets
+DBLP: The dataset contains 14376 papers (P), 20 conferences (C), 14475 authors (A), and 8920 terms (T). There are 33791 nodes in this dataset.
+You can dowload datasets from [here](https://github.com/librahu/HIN-Datasets-for-Recommendation-and-Network-Embedding)
+
+We use the ```DBLP``` dataset for example. After downloading the dataset, put them, let's say, in ```./data/DBLP/``` .
+
+### Dependencies
+- paddlepaddle>=1.6
+- pgl>=1.0.0
+
+### How to run
+Before training, run the below command to do data preprocessing.
+```sh
+python data_process.py --data_path ./data/DBLP  --output_path ./data/data_processed
+```
+
+We adopt [PaddlePaddle Fleet](https://github.com/PaddlePaddle/Fleet) as our distributed training frameworks. ```config.yaml``` is a configure file for metapath2vec hyperparameters and ```local_config``` is a configure file for parameter servers of PaddlePaddle. By default, we have 2 pservers and 2 trainers. One can use ```cloud_run.sh``` to help startup the parameter servers and model trainers. 
+
+For examples, train metapath2vec in distributed mode on DBLP dataset.
+```sh
+# train metapath2vec in distributed mode.
+sh cloud_run.sh
+
+# multiclass task example
+python multi_class.py --dataset ./data/data_processed/author_label.txt --ckpt_path ./checkpoints/2000 --num_nodes 33791
+
+```
+
+### Model Selection
+Actually, There are 3 models in this example, they are ```metapath2vec```, ```metapath2vec++``` and ```multi_metapath2vec++```. You can select different models by modifying ```config.yaml```.
+
+In order to run ```metapath2vec++``` model, you can easily rewrite the hyper parameter of **neg_sample_type** to **m2v_plus**, then ```metapath2vec++``` model will be selected.
+
+```multi-metapath2vec++``` means that you are not only use a single metapath, instead, you can use several metapaths at the same time to train the model. For example, you might want to use ```c2p-p2a-a2p-p2c``` and  ```p2a-a2p``` simultaneously. Then you can rewrite the below hyper parameters in ```config.yaml```.
+- **neg_sample_type**: "m2v_plus"
+- **walk_mode**: "multi_m2v"
+- **meta_path**: "c2p-p2a-a2p-p2c;p2a-a2p"
+- **first_node_type**: "c;p"
+
+### Hyperparameters
+All the hyper parameters are saved in ```config.yaml``` file. So before training, you can open the config.yaml to modify the hyper parameters as you like.
+
+Some important hyper parameters in ```config.yaml```:
+- **edge_path**: the directory of graph data that you want to load
+- **lr**: learning rate
+- **neg_num**: number of negative samples.
+- **num_walks**: number of walks started from each node
+- **walk_len**: walk length
+- **meta_path**: meta path scheme
--- a/examples/distribute_metapath2vec/cloud_run.sh
+++ b/examples/distribute_metapath2vec/cloud_run.sh
+#!/bin/bash 
+set -x
+mode=${1}
+
+source ./utils.sh
+unset http_proxy https_proxy
+
+source ./local_config
+if [ ! -d ${log_dir} ]; then
+    mkdir ${log_dir}
+fi 
+
+for((i=0;i<${PADDLE_PSERVERS_NUM};i++))
+do
+    echo "start ps server: ${i}"
+    echo $log_dir
+    TRAINING_ROLE="PSERVER" PADDLE_TRAINER_ID=${i} sh job.sh &> $log_dir/pserver.$i.log & 
+done
+sleep 10s 
+for((j=0;j<${PADDLE_TRAINERS_NUM};j++))
+do
+    echo "start ps work: ${j}"
+    TRAINING_ROLE="TRAINER" PADDLE_TRAINER_ID=${j} sh job.sh &> $log_dir/worker.$j.log &
+done
+tail -f $log_dir/worker.0.log
--- a/examples/distribute_metapath2vec/cluster_train.py
+++ b/examples/distribute_metapath2vec/cluster_train.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+import os
+import math
+import numpy as np
+
+import paddle.fluid as F
+import paddle.fluid.layers as L
+from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
+import paddle.fluid.incubate.fleet.base.role_maker as role_maker
+from pgl.utils.logger import log
+
+from model import Metapath2vecModel
+from graph import m2vGraph
+from utils import load_config
+from walker import multiprocess_data_generator
+
+
+def init_role():
+    # reset the place according to role of parameter server
+    training_role = os.getenv("TRAINING_ROLE", "TRAINER")
+    paddle_role = role_maker.Role.WORKER
+    place = F.CPUPlace()
+    if training_role == "PSERVER":
+        paddle_role = role_maker.Role.SERVER
+
+    # set the fleet runtime environment according to configure
+    ports = os.getenv("PADDLE_PORT", "6174").split(",")
+    pserver_ips = os.getenv("PADDLE_PSERVERS").split(",")  # ip,ip...
+    eplist = []
+    if len(ports) > 1:
+        # local debug mode, multi port
+        for port in ports:
+            eplist.append(':'.join([pserver_ips[0], port]))
+    else:
+        # distributed mode, multi ip
+        for ip in pserver_ips:
+            eplist.append(':'.join([ip, ports[0]]))
+
+    pserver_endpoints = eplist  # ip:port,ip:port...
+    worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    role = role_maker.UserDefinedRoleMaker(
+        current_id=trainer_id,
+        role=paddle_role,
+        worker_num=worker_num,
+        server_endpoints=pserver_endpoints)
+    fleet.init(role)
+
+
+def optimization(base_lr, loss, train_steps, optimizer='sgd'):
+    decayed_lr = L.learning_rate_scheduler.polynomial_decay(
+        learning_rate=base_lr,
+        decay_steps=train_steps,
+        end_learning_rate=0.0001 * base_lr,
+        power=1.0,
+        cycle=False)
+    if optimizer == 'sgd':
+        optimizer = F.optimizer.SGD(decayed_lr)
+    elif optimizer == 'adam':
+        optimizer = F.optimizer.Adam(decayed_lr, lazy_mode=True)
+    else:
+        raise ValueError
+
+    log.info('learning rate:%f' % (base_lr))
+    #create the DistributeTranspiler configure
+    config = DistributeTranspilerConfig()
+    config.sync_mode = False
+    #config.runtime_split_send_recv = False
+
+    config.slice_var_up = False
+    #create the distributed optimizer
+    optimizer = fleet.distributed_optimizer(optimizer, config)
+    optimizer.minimize(loss)
+
+
+def build_complied_prog(train_program, model_loss):
+    num_threads = int(os.getenv("CPU_NUM", 10))
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    exec_strategy = F.ExecutionStrategy()
+    exec_strategy.num_threads = num_threads
+    #exec_strategy.use_experimental_executor = True
+    build_strategy = F.BuildStrategy()
+    build_strategy.enable_inplace = True
+    #build_strategy.memory_optimize = True
+    build_strategy.memory_optimize = False
+    build_strategy.remove_unnecessary_lock = False
+    if num_threads > 1:
+        build_strategy.reduce_strategy = F.BuildStrategy.ReduceStrategy.Reduce
+
+    compiled_prog = F.compiler.CompiledProgram(
+        train_program).with_data_parallel(loss_name=model_loss.name)
+    return compiled_prog
+
+
+def train_prog(exe, program, loss, node2vec_pyreader, args, train_steps):
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    step = 0
+    if not os.path.exists(args.save_path):
+        os.makedirs(args.save_path)
+    while True:
+        try:
+            begin_time = time.time()
+            loss_val, = exe.run(program, fetch_list=[loss])
+            log.info("step %s: loss %.5f speed: %.5f s/step" %
+                     (step, np.mean(loss_val), time.time() - begin_time))
+            step += 1
+        except F.core.EOFException:
+            node2vec_pyreader.reset()
+
+        if step % args.steps_per_save == 0 or step == train_steps:
+            save_path = args.save_path
+            if trainer_id == 0:
+                model_path = os.path.join(save_path, "%s" % step)
+                fleet.save_persistables(exe, model_path)
+
+        if step == train_steps:
+            break
+
+
+def main(args):
+    log.info("start")
+
+    worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
+    num_devices = int(os.getenv("CPU_NUM", 10))
+
+    model = Metapath2vecModel(config=args)
+    pyreader = model.pyreader
+    loss = model.forward()
+
+    # init fleet
+    init_role()
+
+    train_steps = math.ceil(args.num_nodes * args.epochs / args.batch_size /
+                            num_devices / worker_num)
+    log.info("Train step: %s" % train_steps)
+
+    real_batch_size = args.batch_size * args.walk_len * args.win_size
+    if args.optimizer == "sgd":
+        args.lr *= real_batch_size
+    optimization(args.lr, loss, train_steps, args.optimizer)
+
+    # init and run server or worker
+    if fleet.is_server():
+        fleet.init_server(args.warm_start_from_dir)
+        fleet.run_server()
+
+    if fleet.is_worker():
+        log.info("start init worker done")
+        fleet.init_worker()
+        #just the worker, load the sample
+        log.info("init worker done")
+
+        exe = F.Executor(F.CPUPlace())
+        exe.run(fleet.startup_program)
+        log.info("Startup done")
+
+        dataset = m2vGraph(args)
+        log.info("Build graph done.")
+
+        data_generator = multiprocess_data_generator(args, dataset)
+
+        cur_time = time.time()
+        for idx, _ in enumerate(data_generator()):
+            log.info("iter %s: %s s" % (idx, time.time() - cur_time))
+            cur_time = time.time()
+            if idx == 100:
+                break
+
+        pyreader.decorate_tensor_provider(data_generator)
+        pyreader.start()
+
+        compiled_prog = build_complied_prog(fleet.main_program, loss)
+        train_prog(exe, compiled_prog, loss, pyreader, args, train_steps)
+
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='metapath2vec')
+    parser.add_argument("-c", "--config", type=str, default="./config.yaml")
+    args = parser.parse_args()
+    config = load_config(args.config)
+    log.info(config)
+    main(config)
--- a/examples/distribute_metapath2vec/config.yaml
+++ b/examples/distribute_metapath2vec/config.yaml
+# graph data config
+edge_path: "./data/data_processed"
+edge_files: "p2a:paper_author.txt,p2c:paper_conference.txt,p2t:paper_type.txt"
+node_types_file: "node_types.txt"
+num_nodes: 37791
+symmetry: True
+
+# skipgram pair data config 
+win_size: 5
+neg_num: 5
+# average; m2v_plus
+neg_sample_type: "average" 
+
+# random walk config
+# m2v; multi_m2v;
+walk_mode: "m2v" 
+meta_path: "c2p-p2a-a2p-p2c"
+first_node_type: "c"
+walk_len: 24
+batch_size: 4
+node_shuffle: True
+node_files: null
+num_sample_workers: 2
+
+# model config
+embed_dim: 64
+is_sparse: True
+# only use when num_nodes > 100,000,000, slower than noraml embedding
+is_distributed: False 
+
+# trainging config
+epochs: 10
+optimizer: "sgd"
+lr: 0.1
+warm_start_from_dir: null
+walkpath_files: "None"
+train_files: "None"
+steps_per_save: 1000
+save_path: "./checkpoints"
+log_dir: "./logs"
+CPU_NUM: 16
--- a/examples/distribute_metapath2vec/data_process.py
+++ b/examples/distribute_metapath2vec/data_process.py
--- a/examples/distribute_metapath2vec/graph.py
+++ b/examples/distribute_metapath2vec/graph.py
--- a/examples/distribute_metapath2vec/job.sh
+++ b/examples/distribute_metapath2vec/job.sh
+#!/bin/bash
+
+set -x
+source ./utils.sh
+
+export CPU_NUM=$CPU_NUM
+export FLAGS_rpc_deadline=3000000 
+
+export FLAGS_communicator_send_queue_size=1
+export FLAGS_communicator_min_send_grad_num_before_recv=0
+export FLAGS_communicator_max_merge_var_num=1
+export FLAGS_communicator_merge_sparse_grad=0
+
+python -u cluster_train.py -c config.yaml
--- a/examples/distribute_metapath2vec/local_config
+++ b/examples/distribute_metapath2vec/local_config
+#!/bin/bash 
+export PADDLE_TRAINERS_NUM=2
+export PADDLE_PSERVERS_NUM=2
+export PADDLE_PORT=6184,6185
+export PADDLE_PSERVERS="127.0.0.1"
+
--- a/examples/distribute_metapath2vec/model.py
+++ b/examples/distribute_metapath2vec/model.py
--- a/examples/distribute_metapath2vec/mp_reader.py
+++ b/examples/distribute_metapath2vec/mp_reader.py
--- a/examples/distribute_metapath2vec/multi_class.py
+++ b/examples/distribute_metapath2vec/multi_class.py
--- a/examples/distribute_metapath2vec/utils.py
+++ b/examples/distribute_metapath2vec/utils.py
--- a/examples/distribute_metapath2vec/utils.sh
+++ b/examples/distribute_metapath2vec/utils.sh
+
+# parse yaml file 
+function parse_yaml {
+   local prefix=$2
+   local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
+   sed -ne "s|^\($s\):|\1|" \
+        -e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
+        -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p"  $1 |
+   awk -F$fs '{
+      indent = length($1)/2;
+      vname[indent] = $2;
+      for (i in vname) {if (i > indent) {delete vname[i]}}
+      if (length($3) > 0) {
+         vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
+         printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, $2, $3);
+      }
+   }'
+}
+
+eval $(parse_yaml "$(dirname "${BASH_SOURCE}")"/config.yaml)
--- a/examples/distribute_metapath2vec/walker.py
+++ b/examples/distribute_metapath2vec/walker.py
--- a/examples/gat/README.md
+++ b/examples/gat/README.md
--- a/examples/gat/train.py
+++ b/examples/gat/train.py
--- a/examples/gcn/README.md
+++ b/examples/gcn/README.md
--- a/examples/gcn/train.py
+++ b/examples/gcn/train.py
--- a/examples/ges/README.md
+++ b/examples/ges/README.md
--- a/examples/ges/gpu_run.sh
+++ b/examples/ges/gpu_run.sh
--- a/examples/ges/gpu_train.py
+++ b/examples/ges/gpu_train.py
--- a/examples/ges/model.py
+++ b/examples/ges/model.py
--- a/examples/ges/mp_reader.py
+++ b/examples/ges/mp_reader.py
--- a/examples/ges/reader.py
+++ b/examples/ges/reader.py
--- a/examples/gin/Dataset.py
+++ b/examples/gin/Dataset.py
--- a/examples/gin/README.md
+++ b/examples/gin/README.md
--- a/examples/gin/dataloader.py
+++ b/examples/gin/dataloader.py
--- a/examples/gin/main.py
+++ b/examples/gin/main.py
--- a/examples/gin/model.py
+++ b/examples/gin/model.py
--- a/examples/graphsage/README.md
+++ b/examples/graphsage/README.md
--- a/examples/graphsage/reader.py
+++ b/examples/graphsage/reader.py
--- a/examples/graphsage/train.py
+++ b/examples/graphsage/train.py
--- a/examples/graphsage/train_multi.py
+++ b/examples/graphsage/train_multi.py
--- a/examples/graphsage/train_scale.py
+++ b/examples/graphsage/train_scale.py
--- a/examples/line/README.md
+++ b/examples/line/README.md
--- a/examples/line/data_loader.py
+++ b/examples/line/data_loader.py
--- a/examples/line/data_process.py
+++ b/examples/line/data_process.py
--- a/examples/line/line.py
+++ b/examples/line/line.py
--- a/examples/line/multi_class.py
+++ b/examples/line/multi_class.py
--- a/examples/metapath2vec/Dataset.py
+++ b/examples/metapath2vec/Dataset.py
--- a/examples/metapath2vec/README.md
+++ b/examples/metapath2vec/README.md
--- a/examples/metapath2vec/config.yaml
+++ b/examples/metapath2vec/config.yaml
--- a/examples/metapath2vec/main.py
+++ b/examples/metapath2vec/main.py
--- a/examples/metapath2vec/model.py
+++ b/examples/metapath2vec/model.py
--- a/examples/metapath2vec/multi_class.py
+++ b/examples/metapath2vec/multi_class.py
--- a/examples/metapath2vec/sample.py
+++ b/examples/metapath2vec/sample.py
--- a/examples/metapath2vec/utils.py
+++ b/examples/metapath2vec/utils.py
--- a/examples/node2vec/README.md
+++ b/examples/node2vec/README.md
--- a/examples/pgl-ke/README.md
+++ b/examples/pgl-ke/README.md
--- a/examples/pgl-ke/data_loader.py
+++ b/examples/pgl-ke/data_loader.py
--- a/examples/pgl-ke/evalutate.py
+++ b/examples/pgl-ke/evalutate.py
--- a/examples/pgl-ke/main.py
+++ b/examples/pgl-ke/main.py
--- a/examples/pgl-ke/model/Model.py
+++ b/examples/pgl-ke/model/Model.py
--- a/examples/pgl-ke/model/RotatE.py
+++ b/examples/pgl-ke/model/RotatE.py
--- a/examples/pgl-ke/model/TransE.py
+++ b/examples/pgl-ke/model/TransE.py
--- a/examples/pgl-ke/model/TransR.py
+++ b/examples/pgl-ke/model/TransR.py
--- a/examples/pgl-ke/model/__init__.py
+++ b/examples/pgl-ke/model/__init__.py
--- a/examples/pgl-ke/model/utils.py
+++ b/examples/pgl-ke/model/utils.py
--- a/examples/pgl-ke/mp_mapper.py
+++ b/examples/pgl-ke/mp_mapper.py
--- a/examples/pgl-ke/run.sh
+++ b/examples/pgl-ke/run.sh
--- a/examples/sgc/README.md
+++ b/examples/sgc/README.md
--- a/examples/sgc/sgc.py
+++ b/examples/sgc/sgc.py
--- a/examples/static_gat/README.md
+++ b/examples/static_gat/README.md
--- a/examples/static_gat/train.py
+++ b/examples/static_gat/train.py
--- a/examples/static_gcn/README.md
+++ b/examples/static_gcn/README.md
--- a/examples/static_gcn/train.py
+++ b/examples/static_gcn/train.py
--- a/examples/stgcn/README.md
+++ b/examples/stgcn/README.md
--- a/examples/stgcn/data_loader/__init__.py
+++ b/examples/stgcn/data_loader/__init__.py
--- a/examples/stgcn/data_loader/data_utils.py
+++ b/examples/stgcn/data_loader/data_utils.py
--- a/examples/stgcn/data_loader/graph.py
+++ b/examples/stgcn/data_loader/graph.py
--- a/examples/stgcn/main.py
+++ b/examples/stgcn/main.py
--- a/examples/stgcn/models/model.py
+++ b/examples/stgcn/models/model.py
--- a/examples/stgcn/models/tester.py
+++ b/examples/stgcn/models/tester.py
--- a/examples/stgcn/utils/math_utils.py
+++ b/examples/stgcn/utils/math_utils.py
--- a/examples/strucvec/README.md
+++ b/examples/strucvec/README.md
--- a/examples/strucvec/classify.py
+++ b/examples/strucvec/classify.py
--- a/examples/strucvec/data_loader.py
+++ b/examples/strucvec/data_loader.py
--- a/examples/strucvec/requirements.txt
+++ b/examples/strucvec/requirements.txt
--- a/examples/strucvec/sklearn_classify.py
+++ b/examples/strucvec/sklearn_classify.py
--- a/examples/strucvec/struc2vec.py
+++ b/examples/strucvec/struc2vec.py
--- a/examples/unsup_graphsage/README.md
+++ b/examples/unsup_graphsage/README.md
--- a/examples/unsup_graphsage/model.py
+++ b/examples/unsup_graphsage/model.py
--- a/examples/unsup_graphsage/reader.py
+++ b/examples/unsup_graphsage/reader.py
--- a/examples/unsup_graphsage/sample.txt
+++ b/examples/unsup_graphsage/sample.txt
--- a/examples/unsup_graphsage/train.py
+++ b/examples/unsup_graphsage/train.py
--- a/ogb_examples/graphproppred/main_pgl.py
+++ b/ogb_examples/graphproppred/main_pgl.py
--- a/ogb_examples/linkproppred/main_pgl.py
+++ b/ogb_examples/linkproppred/main_pgl.py
--- a/ogb_examples/nodeproppred/main_pgl.py
+++ b/ogb_examples/nodeproppred/main_pgl.py
--- a/pgl/__init__.py
+++ b/pgl/__init__.py
--- a/pgl/contrib/__init__.py
+++ b/pgl/contrib/__init__.py
--- a/pgl/contrib/ogb/__init__.py
+++ b/pgl/contrib/ogb/__init__.py
--- a/pgl/contrib/ogb/graphproppred/__init__.py
+++ b/pgl/contrib/ogb/graphproppred/__init__.py
--- a/pgl/contrib/ogb/graphproppred/dataset_pgl.py
+++ b/pgl/contrib/ogb/graphproppred/dataset_pgl.py
--- a/pgl/contrib/ogb/graphproppred/mol_encoder.py
+++ b/pgl/contrib/ogb/graphproppred/mol_encoder.py
--- a/pgl/contrib/ogb/io/__init__.py
+++ b/pgl/contrib/ogb/io/__init__.py
--- a/pgl/contrib/ogb/io/read_graph_pgl.py
+++ b/pgl/contrib/ogb/io/read_graph_pgl.py
--- a/pgl/contrib/ogb/linkproppred/__init__.py
+++ b/pgl/contrib/ogb/linkproppred/__init__.py
--- a/pgl/contrib/ogb/linkproppred/dataset_pgl.py
+++ b/pgl/contrib/ogb/linkproppred/dataset_pgl.py
--- a/pgl/contrib/ogb/nodeproppred/__init__.py
+++ b/pgl/contrib/ogb/nodeproppred/__init__.py
--- a/pgl/contrib/ogb/nodeproppred/dataset_pgl.py
+++ b/pgl/contrib/ogb/nodeproppred/dataset_pgl.py
--- a/pgl/data_loader.py
+++ b/pgl/data_loader.py
--- a/pgl/graph.py
+++ b/pgl/graph.py
--- a/pgl/graph_kernel.pyx
+++ b/pgl/graph_kernel.pyx
--- a/pgl/graph_wrapper.py
+++ b/pgl/graph_wrapper.py
--- a/pgl/heter_graph.py
+++ b/pgl/heter_graph.py
--- a/pgl/heter_graph_wrapper.py
+++ b/pgl/heter_graph_wrapper.py
--- a/pgl/layers/__init__.py
+++ b/pgl/layers/__init__.py
--- a/pgl/layers/conv.py
+++ b/pgl/layers/conv.py
--- a/pgl/layers/graph_pool.py
+++ b/pgl/layers/graph_pool.py
--- a/pgl/layers/set2set.py
+++ b/pgl/layers/set2set.py
--- a/pgl/redis_graph.py
+++ b/pgl/redis_graph.py
--- a/pgl/redis_hetergraph.py
+++ b/pgl/redis_hetergraph.py
--- a/pgl/sample.py
+++ b/pgl/sample.py
--- a/pgl/tests/deepwalk/test_alias_sample.py
+++ b/pgl/tests/deepwalk/test_alias_sample.py
--- a/pgl/tests/test_gin.py
+++ b/pgl/tests/test_gin.py
--- a/pgl/tests/test_hetergraph.py
+++ b/pgl/tests/test_hetergraph.py
--- a/pgl/tests/test_metapath_randomwalk.py
+++ b/pgl/tests/test_metapath_randomwalk.py
--- a/pgl/tests/test_redis_graph.py
+++ b/pgl/tests/test_redis_graph.py
--- a/pgl/tests/test_redis_graph_conf.json
+++ b/pgl/tests/test_redis_graph_conf.json
--- a/pgl/tests/test_sample.py
+++ b/pgl/tests/test_sample.py
--- a/pgl/tests/test_set2set.py
+++ b/pgl/tests/test_set2set.py
--- a/pgl/utils/mp_reader.py
+++ b/pgl/utils/mp_reader.py
--- a/pgl/utils/mt_reader.py
+++ b/pgl/utils/mt_reader.py
--- a/pgl/utils/paddle_helper.py
+++ b/pgl/utils/paddle_helper.py
--- a/requirements.txt
+++ b/requirements.txt
--- a/setup.py
+++ b/setup.py
--- a/tests/scatter_add_test.py
+++ b/tests/scatter_add_test.py
--- a/tests/unique_with_counts_test.py
+++ b/tests/unique_with_counts_test.py
--- a/tutorials/1-Introduction.ipynb
+++ b/tutorials/1-Introduction.ipynb