Merge pull request #7 from Yelrose/master

PGL version 1.0.0

Merge pull request #7 from Yelrose/master
PGL version 1.0.0
6665dc5e · Huang Zhengjie · GitHub · 0f357b0b · 32ecdce0 · 6665dc5e
94 changed file
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -23,7 +23,7 @@ repos:
    sha: 5bf6c09bfa1297d3692cadd621ef95f1284e33c0
    hooks:
    -   id: check-added-large-files
-        args: [--maxkb=1024]
+        args: [--maxkb=4096]
    -   id: check-merge-conflict
    -   id: check-symlinks
    -   id: detect-private-key

--- a/docs/requirements.txt
+++ b/docs/requirements.txt
 sphinx==2.1.0
 mistune
 sphinx_rtd_theme
+numpy >= 1.16.4
+cython >= 0.25.2
+paddlepaddle
+pgl
--- a/docs/source/api/pgl.rst
+++ b/docs/source/api/pgl.rst
@@ -8,3 +8,4 @@ API Reference
   pgl.layers
   pgl.data_loader
   pgl.utils.paddle_helper
+   pgl.utils.mp_reader
--- a/docs/source/api/pgl.utils.mp_reader.rst
+++ b/docs/source/api/pgl.utils.mp_reader.rst
+pgl.utils.mp\_reader module: MultiProcessing reader helper function for Paddle.
+===============================
+.. automodule:: pgl.utils.mp_reader
+   :members:
+   :undoc-members:
+   :show-inheritance:
--- a/docs/source/examples/md/gat_examples.md
+++ b/docs/source/examples/md/gat_examples.md
@@ -32,18 +32,18 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
 ### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
+- paddlepaddle>=1.6
 - pgl
 ### Performance
 We train our models for 200 epochs and report the accuracy on the test dataset.
-| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
+| Dataset | Accuracy |
-| --- | --- | --- |---|
+| --- | --- |
-| Cora | ~83% | 0.0188s | 0.0175s |
+| Cora | ~83% |
-| Pubmed | ~78% | 0.0449s  | 0.0295s |
+| Pubmed | ~78% |
-| Citeseer | ~70% | 0.0275 | 0.0253s |
+| Citeseer | ~70% |
 ### How to run

--- a/docs/source/examples/md/gcn_examples.md
+++ b/docs/source/examples/md/gcn_examples.md
@@ -27,18 +27,18 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
 ### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
+- paddlepaddle>=1.6
 - pgl
 ### Performance
 We train our models for 200 epochs and report the accuracy on the test dataset.
-| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
+| Dataset | Accuracy |
-| --- | --- | --- |---|
+| --- | --- |
-| Cora | ~81% | 0.0106s | 0.0104s | 
+| Cora | ~81% | 
-| Pubmed | ~79% | 0.0210s  | 0.0154s |
+| Pubmed | ~79% |
-| Citeseer | ~71% | 0.0175s | 0.0177s | 
+| Citeseer | ~71% | 
 ### How to run

--- a/docs/source/examples/md/graphsage_examples.md
+++ b/docs/source/examples/md/graphsage_examples.md
@@ -12,7 +12,7 @@ The reddit dataset should be downloaded from the following links and placed in d
 ### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
+- paddlepaddle>=1.6
 - pgl
 ### How to run
@@ -22,6 +22,14 @@ To train a GraphSAGE model on Reddit Dataset, you can just run
 python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry     
 ```
+If you want to train a GraphSAGE model with multiple GPUs, you can just run
+```
+CUDA_VISIBLE_DEVICES=0,1 python train_multi.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry  --num_trainer 2    
+```
 #### Hyperparameters
 - epoch: Number of epochs default (10)

--- a/docs/source/examples/md/node2vec_examples.md
+++ b/docs/source/examples/md/node2vec_examples.md
@@ -5,7 +5,7 @@
 ## Datasets
 The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3) and [Arxiv](http://snap.stanford.edu/data/ca-AstroPh.html).
 ## Dependencies
- paddlepaddle>=1.4
+- paddlepaddle>=1.6
 - pgl
 ## How to run

--- a/docs/source/examples/md/static_gat_examples.md
+++ b/docs/source/examples/md/static_gat_examples.md
@@ -19,11 +19,11 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
 We train our models for 200 epochs and report the accuracy on the test dataset.
-| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gat | Improvement |
+| Dataset | Accuracy | epoch time | examples/gat | Improvement |
-| --- | --- | --- |---| --- | --- |
+| --- | --- | --- | --- | --- |
-| Cora | ~83% | 0.0145s | 0.0119s | 0.0175s | 1.47x |
+| Cora | ~83% | 0.0119s | 0.0175s | 1.47x |
-| Pubmed | ~78% | 0.0352s | 0.0193s |0.0295s | 1.53x |
+| Pubmed | ~78% | 0.0193s |0.0295s | 1.53x |
-| Citeseer | ~70% | 0.0148s | 0.0124s |0.0253s | 2.04x |
+| Citeseer | ~70% | 0.0124s |0.0253s | 2.04x |
 ### How to run

--- a/docs/source/examples/md/static_gcn_examples.md
+++ b/docs/source/examples/md/static_gcn_examples.md
@@ -10,7 +10,7 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
 ### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
+- paddlepaddle>=1.6
 - pgl
 ### Performance
@@ -18,11 +18,11 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
 We train our models for 200 epochs and report the accuracy on the test dataset.
-| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gcn | Improvement |
+| Dataset | Accuracy | epoch time | examples/gcn | Improvement |
 | --- | --- | --- |---| --- | --- |
-| Cora | ~81% | 0.0053s | 0.0047s | 0.0104s | 2.21x |
+| Cora | ~81% | 0.0047s | 0.0104s | 2.21x |
-| Pubmed | ~79% | 0.0105s  | 0.0049s |0.0154s | 3.14x |
+| Pubmed | ~79% | 0.0049s |0.0154s | 3.14x |
-| Citeseer | ~71% | 0.0051s | 0.0045s |0.0177s | 3.93x |
+| Citeseer | ~71% | 0.0045s |0.0177s | 3.93x |

--- a/docs/source/instruction.rst
+++ b/docs/source/instruction.rst
@@ -8,8 +8,7 @@ To install Paddle Graph Learning, we need the following packages.
 .. code-block:: sh
-    paddlepaddle >= 1.4 (Faster performance on 1.5)
+    paddlepaddle >= 1.6
-    networkx
    cython
 We can simply install pgl by pip.

--- a/docs/source/md/introduction.md
+++ b/docs/source/md/introduction.md
@@ -35,8 +35,8 @@ Users only need to call the ```sequence_ops``` functions provided by Paddle to e
        return fluid.layers.sequence_pool(msg, "sum")
 ```
+Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
-Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
 ## Performance
@@ -50,11 +50,3 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t
 | Pubmed | GAT | 77% |0.0193s|**0.0144s**|
 | Citeseer | GCN |70.2%| **0.0045** |0.0046s|
 | Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
-If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) that aggregates neighbor features with LSTM ignoring the order of recieved messages, the optimized message-passing in DGL will be forced to degenerate into degree bucketing scheme. The speed performance will be much slower than the one implemented in PGL. Performances may be various with different scale of the graph, in our experiments, PGL can reach up to 13 times the speed of DGL.
-| Dataset |   PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | Speed up|
-| -------- |  ------------ | ------------------------------------ |----|
-| Cora | **0.0186s** | 0.1638s | 8.80x|
-| Pubmed | **0.0388s** |0.5275s | 13.59x|
-| Citeseer | **0.0150s** | 0.1278s | 8.52x |
--- a/docs/source/md/quick_start.md
+++ b/docs/source/md/quick_start.md
@@ -95,7 +95,7 @@ After defining the GCN layer, we can construct a deeper GCN model with two GCN l
 ```python
 output = gcn_layer(gw, gw.node_feat['feature'],
                hidden_size=8, name='gcn_layer_1', activation='relu')
-output = gcn_layer(gw, output, hidden_size=1,
+output = gcn_layer(gw, output, hidden_size=2,
                name='gcn_layer_2', activation=None)
 ```

--- a/examples/dgi/README.md
+++ b/examples/dgi/README.md
+# PGL Examples for GCN
+[Deep Graph Infomax \(DGI\)](https://arxiv.org/abs/1809.10341) is a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures.
+### Datasets
+The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
+### Dependencies
+- paddlepaddle>=1.6
+- pgl
+### Performance
+We use DGI to pretrain embeddings for each nodes. Then we fix the embedding to train a node classifier.
+| Dataset | Accuracy | 
+| --- | --- |
+| Cora | ~81% | 
+| Pubmed | ~77.6% |
+| Citeseer | ~71.3% |
+### How to run
+For examples, use gpu to train gcn on cora dataset.
+```
+python dgi.py --dataset cora --use_cuda
+python train.py --dataset cora --use_cuda
+```
+#### Hyperparameters
+- dataset: The citation dataset "cora", "citeseer", "pubmed".
+- use_cuda: Use gpu if assign use_cuda. 
--- a/examples/dgi/dgi.py
+++ b/examples/dgi/dgi.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    DGI Pretrain
+"""
+import os
+import pgl
+from pgl import data_loader
+from pgl.utils.logger import log
+import paddle.fluid as fluid
+import numpy as np
+import time
+import argparse
+def load(name):
+    """Load dataset"""
+    if name == 'cora':
+        dataset = data_loader.CoraDataset()
+    elif name == "pubmed":
+        dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
+    elif name == "citeseer":
+        dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
+    else:
+        raise ValueError(name + " dataset doesn't exists")
+    return dataset
+def save_param(dirname, var_name_list):
+    """save_param"""
+    for var_name in var_name_list:
+        var = fluid.global_scope().find_var(var_name)
+        var_tensor = var.get_tensor()
+        np.save(os.path.join(dirname, var_name + '.npy'), np.array(var_tensor))
+def main(args):
+    """main"""
+    dataset = load(args.dataset)
+    # normalize
+    indegree = dataset.graph.indegree()
+    norm = np.zeros_like(indegree, dtype="float32")
+    norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
+    dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    hidden_size = 512
+    with fluid.program_guard(train_program, startup_program):
+        pos_gw = pgl.graph_wrapper.GraphWrapper(
+            name="pos_graph",
+            place=place,
+            node_feat=dataset.graph.node_feat_info())
+        neg_gw = pgl.graph_wrapper.GraphWrapper(
+            name="neg_graph",
+            place=place,
+            node_feat=dataset.graph.node_feat_info())
+        positive_feat = pgl.layers.gcn(pos_gw,
+                                       pos_gw.node_feat["words"],
+                                       hidden_size,
+                                       activation="relu",
+                                       norm=pos_gw.node_feat['norm'],
+                                       name="gcn_layer_1")
+        negative_feat = pgl.layers.gcn(neg_gw,
+                                       neg_gw.node_feat["words"],
+                                       hidden_size,
+                                       activation="relu",
+                                       norm=neg_gw.node_feat['norm'],
+                                       name="gcn_layer_1")
+        summary_feat = fluid.layers.sigmoid(
+            fluid.layers.reduce_mean(
+                positive_feat, [0], keep_dim=True))
+        summary_feat = fluid.layers.fc(summary_feat,
+                                       hidden_size,
+                                       bias_attr=False,
+                                       name="discriminator")
+        pos_logits = fluid.layers.matmul(
+            positive_feat, summary_feat, transpose_y=True)
+        neg_logits = fluid.layers.matmul(
+            negative_feat, summary_feat, transpose_y=True)
+        pos_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
+            x=pos_logits,
+            label=fluid.layers.ones(
+                shape=[dataset.graph.num_nodes, 1], dtype="float32"))
+        neg_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
+            x=neg_logits,
+            label=fluid.layers.zeros(
+                shape=[dataset.graph.num_nodes, 1], dtype="float32"))
+        loss = fluid.layers.reduce_mean(pos_loss) + fluid.layers.reduce_mean(
+            neg_loss)
+        adam = fluid.optimizer.Adam(learning_rate=1e-3)
+        adam.minimize(loss)
+    exe = fluid.Executor(place)
+    exe.run(startup_program)
+    best_loss = 1e9
+    dur = []
+    for epoch in range(args.epoch):
+        feed_dict = pos_gw.to_feed(dataset.graph)
+        node_feat = dataset.graph.node_feat["words"].copy()
+        perm = np.arange(0, dataset.graph.num_nodes)
+        np.random.shuffle(perm)
+        dataset.graph.node_feat["words"] = dataset.graph.node_feat["words"][
+            perm]
+        feed_dict.update(neg_gw.to_feed(dataset.graph))
+        dataset.graph.node_feat["words"] = node_feat
+        if epoch >= 3:
+            t0 = time.time()
+        train_loss = exe.run(train_program,
+                             feed=feed_dict,
+                             fetch_list=[loss],
+                             return_numpy=True)
+        if train_loss[0] < best_loss:
+            best_loss = train_loss[0]
+            save_param(args.checkpoint, ["gcn_layer_1", "gcn_layer_1_bias"])
+        if epoch >= 3:
+            time_per_epoch = 1.0 * (time.time() - t0)
+            dur.append(time_per_epoch)
+        log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
+                 "Train Loss: %f " % train_loss[0])
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='DGI pretrain')
+    parser.add_argument(
+        "--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
+    parser.add_argument(
+        "--checkpoint", type=str, default="best_model", help="checkpoint")
+    parser.add_argument(
+        "--epoch", type=int, default=200, help="pretrain epochs")
+    parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/dgi/train.py
+++ b/examples/dgi/train.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    Train
+"""
+import os
+import pgl
+from pgl import data_loader
+from pgl.utils.logger import log
+import paddle.fluid as fluid
+import numpy as np
+import time
+import argparse
+def load(name):
+    """Load"""
+    if name == 'cora':
+        dataset = data_loader.CoraDataset()
+    elif name == "pubmed":
+        dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
+    elif name == "citeseer":
+        dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
+    else:
+        raise ValueError(name + " dataset doesn't exists")
+    return dataset
+def load_param(dirname, var_name_list):
+    """load_param"""
+    for var_name in var_name_list:
+        var = fluid.global_scope().find_var(var_name)
+        var_tensor = var.get_tensor()
+        var_tmp = np.load(os.path.join(dirname, var_name + '.npy'))
+        var_tensor.set(var_tmp, fluid.CPUPlace())
+def main(args):
+    """main"""
+    dataset = load(args.dataset)
+    # normalize
+    indegree = dataset.graph.indegree()
+    norm = np.zeros_like(indegree, dtype="float32")
+    norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
+    dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    test_program = fluid.Program()
+    hidden_size = 512
+    with fluid.program_guard(train_program, startup_program):
+        gw = pgl.graph_wrapper.GraphWrapper(
+            name="graph",
+            place=place,
+            node_feat=dataset.graph.node_feat_info())
+        output = pgl.layers.gcn(gw,
+                                gw.node_feat["words"],
+                                hidden_size,
+                                activation="relu",
+                                norm=gw.node_feat['norm'],
+                                name="gcn_layer_1")
+        output.stop_gradient = True
+        output = fluid.layers.fc(output,
+                                 dataset.num_classes,
+                                 act=None,
+                                 name="classifier")
+        node_index = fluid.layers.data(
+            "node_index",
+            shape=[None, 1],
+            dtype="int64",
+            append_batch_size=False)
+        node_label = fluid.layers.data(
+            "node_label",
+            shape=[None, 1],
+            dtype="int64",
+            append_batch_size=False)
+        pred = fluid.layers.gather(output, node_index)
+        loss, pred = fluid.layers.softmax_with_cross_entropy(
+            logits=pred, label=node_label, return_softmax=True)
+        acc = fluid.layers.accuracy(input=pred, label=node_label, k=1)
+        loss = fluid.layers.mean(loss)
+    test_program = train_program.clone(for_test=True)
+    with fluid.program_guard(train_program, startup_program):
+        adam = fluid.optimizer.Adam(learning_rate=1e-2)
+        adam.minimize(loss)
+    exe = fluid.Executor(place)
+    exe.run(startup_program)
+    load_param(args.checkpoint, ["gcn_layer_1", "gcn_layer_1_bias"])
+    feed_dict = gw.to_feed(dataset.graph)
+    train_index = dataset.train_index
+    train_label = np.expand_dims(dataset.y[train_index], -1)
+    train_index = np.expand_dims(train_index, -1)
+    val_index = dataset.val_index
+    val_label = np.expand_dims(dataset.y[val_index], -1)
+    val_index = np.expand_dims(val_index, -1)
+    test_index = dataset.test_index
+    test_label = np.expand_dims(dataset.y[test_index], -1)
+    test_index = np.expand_dims(test_index, -1)
+    dur = []
+    for epoch in range(200):
+        if epoch >= 3:
+            t0 = time.time()
+        feed_dict["node_index"] = np.array(train_index, dtype="int64")
+        feed_dict["node_label"] = np.array(train_label, dtype="int64")
+        train_loss, train_acc = exe.run(train_program,
+                                        feed=feed_dict,
+                                        fetch_list=[loss, acc],
+                                        return_numpy=True)
+        if epoch >= 3:
+            time_per_epoch = 1.0 * (time.time() - t0)
+            dur.append(time_per_epoch)
+        feed_dict["node_index"] = np.array(val_index, dtype="int64")
+        feed_dict["node_label"] = np.array(val_label, dtype="int64")
+        val_loss, val_acc = exe.run(test_program,
+                                    feed=feed_dict,
+                                    fetch_list=[loss, acc],
+                                    return_numpy=True)
+        log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
+                 "Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+                 + "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
+    feed_dict["node_index"] = np.array(test_index, dtype="int64")
+    feed_dict["node_label"] = np.array(test_label, dtype="int64")
+    test_loss, test_acc = exe.run(test_program,
+                                  feed=feed_dict,
+                                  fetch_list=[loss, acc],
+                                  return_numpy=True)
+    log.info("Accuracy: %f" % test_acc)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='GCN')
+    parser.add_argument(
+        "--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
+    parser.add_argument(
+        "--checkpoint", type=str, default="best_model", help="checkpoint")
+    parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/distribute_deepwalk/README.md
+++ b/examples/distribute_deepwalk/README.md
+# PGL Examples for distributed deepwalk
+[Deepwalk](https://arxiv.org/pdf/1403.6652.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce distributed deepwalk algorithms and reach the same level of indicators as the paper.
+## Datasets
+The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3). 
+## Dependencies
+- paddlepaddle>=1.6
+- pgl>=1.0
+## How to run
+For examples, train deepwalk in distributed mode on cora dataset.
+```sh
+# train deepwalk in distributed mode.
+sh cloud_run.sh
+# multiclass task example
+python3 multi_class.py --use_cuda --ckpt_path ./model_path/4029 --epoch 1000
+```
+## Hyperparameters
+- dataset: The citation dataset "BlogCatalog".
+- hidden_size: Hidden size of the embedding. 
+- lr: Learning rate. 
+- neg_num: Number of negative samples.
+- epoch: Number of training epoch.
+### Experiment results
+Dataset|model|Task|Metric|PGL Result|Reported Result 
+--|--|--|--|--|--
+BlogCatalog|distributed deepwalk|multi-label classification|MacroF1|0.233|0.211
--- a/examples/distribute_deepwalk/cloud_run.sh
+++ b/examples/distribute_deepwalk/cloud_run.sh
+#!/bin/bash 
+set -x
+source ./pgl_deepwalk.cfg
+source ./local_config
+unset http_proxy https_proxy
+# build train_data
+trainer_num=`echo $PADDLE_PORT | awk -F',' '{print NF}'`
+rm -rf train_data && mkdir -p train_data 
+cd train_data
+if [[ $build_train_data == True ]];then
+    seq 0 $((num_nodes-1)) | shuf | split -l $((num_nodes/trainer_num/CPU_NUM+1))
+else
+    for i in `seq 1 $trainer_num`;do
+        touch $i
+    done
+fi
+cd - 
+# mkdir workspace
+if [ -d ${BASE} ]; then
+    rm -rf ${BASE}
+fi 
+mkdir ${BASE}
+# start ps
+for((i=0;i<${PADDLE_PSERVERS_NUM};i++))
+do
+    echo "start ps server: ${i}"
+    echo $BASE
+    TRAINING_ROLE="PSERVER" PADDLE_TRAINER_ID=${i} sh job.sh &> $BASE/pserver.$i.log & 
+done
+sleep 5s 
+# start trainers
+for((j=0;j<${PADDLE_TRAINERS_NUM};j++))
+do
+    echo "start ps work: ${j}"
+    TRAINING_ROLE="TRAINER" PADDLE_TRAINER_ID=${j} sh job.sh &> $BASE/worker.$j.log &
+done
--- a/examples/distribute_deepwalk/cluster_train.py
+++ b/examples/distribute_deepwalk/cluster_train.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License"); 
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+import os
+import math
+from multiprocessing import Process
+import numpy as np
+import paddle.fluid as F
+import paddle.fluid.layers as L
+from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
+from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
+import paddle.fluid.incubate.fleet.base.role_maker as role_maker
+from pgl.utils.logger import log
+from pgl import data_loader
+from reader import DeepwalkReader
+from model import DeepwalkModel
+from utils import get_file_list
+from utils import build_graph
+from utils import build_fake_graph
+from utils import build_gen_func
+def init_role():
+    # reset the place according to role of parameter server
+    training_role = os.getenv("TRAINING_ROLE", "TRAINER")
+    paddle_role = role_maker.Role.WORKER
+    place = F.CPUPlace()
+    if training_role == "PSERVER":
+        paddle_role = role_maker.Role.SERVER
+    # set the fleet runtime environment according to configure
+    ports = os.getenv("PADDLE_PORT", "6174").split(",")
+    pserver_ips = os.getenv("PADDLE_PSERVERS").split(",")  # ip,ip...
+    eplist = []
+    if len(ports) > 1:
+        # local debug mode, multi port
+        for port in ports:
+            eplist.append(':'.join([pserver_ips[0], port]))
+    else:
+        # distributed mode, multi ip
+        for ip in pserver_ips:
+            eplist.append(':'.join([ip, ports[0]]))
+    pserver_endpoints = eplist  # ip:port,ip:port...
+    worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    role = role_maker.UserDefinedRoleMaker(
+        current_id=trainer_id,
+        role=paddle_role,
+        worker_num=worker_num,
+        server_endpoints=pserver_endpoints)
+    fleet.init(role)
+def optimization(base_lr, loss, train_steps, optimizer='sgd'):
+    decayed_lr = L.learning_rate_scheduler.polynomial_decay(
+        learning_rate=base_lr,
+        decay_steps=train_steps,
+        end_learning_rate=0.0001 * base_lr,
+        power=1.0,
+        cycle=False)
+    if optimizer == 'sgd':
+        optimizer = F.optimizer.SGD(decayed_lr)
+    elif optimizer == 'adam':
+        optimizer = F.optimizer.Adam(decayed_lr, lazy_mode=True)
+    else:
+        raise ValueError
+    log.info('learning rate:%f' % (base_lr))
+    #create the DistributeTranspiler configure
+    config = DistributeTranspilerConfig()
+    config.sync_mode = False
+    #config.runtime_split_send_recv = False
+    config.slice_var_up = False
+    #create the distributed optimizer
+    optimizer = fleet.distributed_optimizer(optimizer, config)
+    optimizer.minimize(loss)
+def build_complied_prog(train_program, model_loss):
+    num_threads = int(os.getenv("CPU_NUM", 10))
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
+    exec_strategy = F.ExecutionStrategy()
+    exec_strategy.num_threads = num_threads
+    #exec_strategy.use_experimental_executor = True
+    build_strategy = F.BuildStrategy()
+    build_strategy.enable_inplace = True
+    #build_strategy.memory_optimize = True
+    build_strategy.memory_optimize = False
+    build_strategy.remove_unnecessary_lock = False
+    if num_threads > 1:
+        build_strategy.reduce_strategy = F.BuildStrategy.ReduceStrategy.Reduce
+    compiled_prog = F.compiler.CompiledProgram(
+        train_program).with_data_parallel(
+            loss_name=model_loss.name,
+            build_strategy=build_strategy,
+            exec_strategy=exec_strategy)
+    return compiled_prog
+def train_prog(exe, program, loss, node2vec_pyreader, args, train_steps):
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    step = 0
+    while True:
+        try:
+            begin_time = time.time()
+            loss_val, = exe.run(program, fetch_list=[loss])
+            log.info("step %s: loss %.5f speed: %.5f s/step" %
+                     (step, np.mean(loss_val), time.time() - begin_time))
+            step += 1
+        except F.core.EOFException:
+            node2vec_pyreader.reset()
+        if step % args.steps_per_save == 0 or step == train_steps:
+            if trainer_id == 0 or args.is_distributed:
+                model_save_dir = args.save_path
+                model_path = os.path.join(model_save_dir, str(step))
+                if not os.path.exists(model_save_dir):
+                    os.makedirs(model_save_dir)
+                fleet.save_persistables(exe, model_path)
+        if step == train_steps:
+            break
+def test(args):
+    graph = build_graph(args.num_nodes, args.edge_path)
+    gen_func = build_gen_func(args, graph)
+    start = time.time()
+    num = 10
+    for idx, _ in enumerate(gen_func()):
+        if idx % num == num - 1:
+            log.info("%s" % (1.0 * (time.time() - start) / num))
+            start = time.time()
+def walk(args):
+    graph = build_graph(args.num_nodes, args.edge_path)
+    num_sample_workers = args.num_sample_workers
+    if args.train_files is None or args.train_files == "None":
+        log.info("Walking from graph...")
+        train_files = [None for _ in range(num_sample_workers)]
+    else:
+        log.info("Walking from train_data...")
+        files = get_file_list(args.train_files)
+        train_files = [[] for i in range(num_sample_workers)]
+        for idx, f in enumerate(files):
+            train_files[idx % num_sample_workers].append(f)
+    def walk_to_file(walk_gen, filename, max_num):
+        with open(filename, "w") as outf:
+            num = 0
+            for walks in walk_gen:
+                for walk in walks:
+                    outf.write("%s\n" % "\t".join([str(i) for i in walk]))
+                    num += 1
+                    if num % 1000 == 0:
+                        log.info("Total: %s, %s walkpath is saved. " %
+                                 (max_num, num))
+                    if num == max_num:
+                        return
+    m_args = [(DeepwalkReader(
+        graph,
+        batch_size=args.batch_size,
+        walk_len=args.walk_len,
+        win_size=args.win_size,
+        neg_num=args.neg_num,
+        neg_sample_type=args.neg_sample_type,
+        walkpath_files=None,
+        train_files=train_files[i]).walk_generator(),
+               "%s/%s" % (args.walkpath_files, i),
+               args.epoch * args.num_nodes // args.num_sample_workers)
+              for i in range(num_sample_workers)]
+    ps = []
+    for i in range(num_sample_workers):
+        p = Process(target=walk_to_file, args=m_args[i])
+        p.start()
+        ps.append(p)
+    for i in range(num_sample_workers):
+        ps[i].join()
+def train(args):
+    import logging
+    log.setLevel(logging.DEBUG)
+    log.info("start")
+    worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
+    num_devices = int(os.getenv("CPU_NUM", 10))
+    model = DeepwalkModel(args.num_nodes, args.hidden_size, args.neg_num,
+                          args.is_sparse, args.is_distributed, 1.)
+    pyreader = model.pyreader
+    loss = model.forward()
+    # init fleet
+    init_role()
+    train_steps = math.ceil(1. * args.num_nodes * args.epoch /
+                            args.batch_size / num_devices / worker_num)
+    log.info("Train step: %s" % train_steps)
+    if args.optimizer == "sgd":
+        args.lr *= args.batch_size * args.walk_len * args.win_size
+    optimization(args.lr, loss, train_steps, args.optimizer)
+    # init and run server or worker
+    if fleet.is_server():
+        fleet.init_server(args.warm_start_from_dir)
+        fleet.run_server()
+    if fleet.is_worker():
+        log.info("start init worker done")
+        fleet.init_worker()
+        #just the worker, load the sample
+        log.info("init worker done")
+        exe = F.Executor(F.CPUPlace())
+        exe.run(fleet.startup_program)
+        log.info("Startup done")
+        if args.dataset is not None:
+            if args.dataset == "BlogCatalog":
+                graph = data_loader.BlogCatalogDataset().graph
+            elif args.dataset == "ArXiv":
+                graph = data_loader.ArXivDataset().graph
+            else:
+                raise ValueError(args.dataset + " dataset doesn't exists")
+            log.info("Load buildin BlogCatalog dataset done.")
+        elif args.walkpath_files is None or args.walkpath_files == "None":
+            graph = build_graph(args.num_nodes, args.edge_path)
+            log.info("Load graph from '%s' done." % args.edge_path)
+        else:
+            graph = build_fake_graph(args.num_nodes)
+            log.info("Load fake graph done.")
+        # bind gen
+        gen_func = build_gen_func(args, graph)
+        pyreader.decorate_tensor_provider(gen_func)
+        pyreader.start()
+        compiled_prog = build_complied_prog(fleet.main_program, loss)
+        train_prog(exe, compiled_prog, loss, pyreader, args, train_steps)
+if __name__ == '__main__':
+    def str2bool(v):
+        if isinstance(v, bool):
+            return v
+        if v.lower() in ('yes', 'true', 't', 'y', '1'):
+            return True
+        elif v.lower() in ('no', 'false', 'f', 'n', '0'):
+            return False
+        else:
+            raise argparse.ArgumentTypeError('Boolean value expected.')
+    parser = argparse.ArgumentParser(description='Deepwalk')
+    parser.add_argument(
+        "--hidden_size",
+        type=int,
+        default=64,
+        help="Hidden size of the embedding.")
+    parser.add_argument(
+        "--lr", type=float, default=0.025, help="Learning rate.")
+    parser.add_argument(
+        "--neg_num", type=int, default=5, help="Number of negative samples.")
+    parser.add_argument(
+        "--epoch", type=int, default=1, help="Number of training epoch.")
+    parser.add_argument(
+        "--batch_size",
+        type=int,
+        default=128,
+        help="Numbert of walk paths in a batch.")
+    parser.add_argument(
+        "--walk_len", type=int, default=40, help="Length of a walk path.")
+    parser.add_argument(
+        "--win_size", type=int, default=5, help="Window size in skip-gram.")
+    parser.add_argument(
+        "--save_path",
+        type=str,
+        default="model_path",
+        help="Output path for saving model.")
+    parser.add_argument(
+        "--num_sample_workers",
+        type=int,
+        default=1,
+        help="Number of sampling workers.")
+    parser.add_argument(
+        "--steps_per_save",
+        type=int,
+        default=3000,
+        help="Steps for model saveing.")
+    parser.add_argument(
+        "--num_nodes",
+        type=int,
+        default=10000,
+        help="Number of nodes in graph.")
+    parser.add_argument("--edge_path", type=str, default="./graph_data")
+    parser.add_argument("--train_files", type=str, default=None)
+    parser.add_argument("--walkpath_files", type=str, default=None)
+    parser.add_argument("--is_distributed", type=str2bool, default=False)
+    parser.add_argument("--is_sparse", type=str2bool, default=True)
+    parser.add_argument("--warm_start_from_dir", type=str, default=None)
+    parser.add_argument("--dataset", type=str, default=None)
+    parser.add_argument(
+        "--neg_sample_type",
+        type=str,
+        default="average",
+        choices=["average", "outdegree"])
+    parser.add_argument(
+        "--mode",
+        type=str,
+        required=False,
+        choices=['train', 'walk'],
+        default="train")
+    parser.add_argument(
+        "--optimizer",
+        type=str,
+        required=False,
+        choices=['adam', 'sgd'],
+        default="sgd")
+    args = parser.parse_args()
+    log.info(args)
+    if args.mode == "train":
+        train(args)
+    elif args.mode == "walk":
+        walk(args)
--- a/examples/distribute_deepwalk/gpu_train.py
+++ b/examples/distribute_deepwalk/gpu_train.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+import os
+import numpy as np
+import paddle.fluid as F
+import paddle.fluid.layers as L
+from pgl.utils.logger import log
+from model import DeepwalkModel
+from utils import build_graph
+from utils import build_gen_func
+def optimization(base_lr, loss, train_steps, optimizer='adam'):
+    decayed_lr = L.polynomial_decay(base_lr, train_steps, 0.0001)
+    if optimizer == 'sgd':
+        optimizer = F.optimizer.SGD(
+            decayed_lr,
+            regularization=F.regularizer.L2DecayRegularizer(
+                regularization_coeff=0.0025))
+    elif optimizer == 'adam':
+        # dont use gpu's lazy mode
+        optimizer = F.optimizer.Adam(decayed_lr)
+    else:
+        raise ValueError
+    log.info('learning rate:%f' % (base_lr))
+    optimizer.minimize(loss)
+def get_parallel_exe(program, loss):
+    exec_strategy = F.ExecutionStrategy()
+    exec_strategy.num_threads = 1  #2 for fp32 4 for fp16
+    exec_strategy.use_experimental_executor = True
+    exec_strategy.num_iteration_per_drop_scope = 1  #important shit
+    build_strategy = F.BuildStrategy()
+    build_strategy.enable_inplace = True
+    build_strategy.memory_optimize = True
+    build_strategy.remove_unnecessary_lock = True
+    #return compiled_prog
+    train_exe = F.ParallelExecutor(
+        use_cuda=True,
+        loss_name=loss.name,
+        build_strategy=build_strategy,
+        exec_strategy=exec_strategy,
+        main_program=program)
+    return train_exe
+def train(train_exe, exe, program, loss, node2vec_pyreader, args, train_steps):
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    step = 0
+    while True:
+        try:
+            begin_time = time.time()
+            loss_val, = train_exe.run(fetch_list=[loss])
+            log.info("step %s: loss %.5f speed: %.5f s/step" %
+                     (step, np.mean(loss_val), time.time() - begin_time))
+            step += 1
+        except F.core.EOFException:
+            node2vec_pyreader.reset()
+        if (step == train_steps or
+                step % args.steps_per_save == 0) and trainer_id == 0:
+            model_save_dir = args.output_path
+            model_path = os.path.join(model_save_dir, str(step))
+            if not os.path.exists(model_save_dir):
+                os.makedirs(model_save_dir)
+            F.io.save_params(exe, model_path, program)
+        if step == train_steps:
+            break
+def main(args):
+    import logging
+    log.setLevel(logging.DEBUG)
+    log.info("start")
+    num_devices = len(F.cuda_places())
+    model = DeepwalkModel(args.num_nodes, args.hidden_size, args.neg_num,
+                          False, False, 1.)
+    pyreader = model.pyreader
+    loss = model.forward()
+    train_steps = int(args.num_nodes * args.epoch / args.batch_size /
+                      num_devices)
+    optimization(args.lr * num_devices, loss, train_steps, args.optimizer)
+    place = F.CUDAPlace(0)
+    exe = F.Executor(place)
+    exe.run(F.default_startup_program())
+    graph = build_graph(args.num_nodes, args.edge_path)
+    gen_func = build_gen_func(args, graph)
+    pyreader.decorate_tensor_provider(gen_func)
+    pyreader.start()
+    train_prog = F.default_main_program()
+    if args.warm_start_from_dir is not None:
+        F.io.load_params(exe, args.warm_start_from_dir, train_prog)
+    train_exe = get_parallel_exe(train_prog, loss)
+    train(train_exe, exe, train_prog, loss, pyreader, args, train_steps)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Deepwalk')
+    parser.add_argument("--hidden_size", type=int, default=64)
+    parser.add_argument("--lr", type=float, default=0.025)
+    parser.add_argument("--neg_num", type=int, default=5)
+    parser.add_argument("--epoch", type=int, default=100)
+    parser.add_argument("--batch_size", type=int, default=128)
+    parser.add_argument("--walk_len", type=int, default=40)
+    parser.add_argument("--win_size", type=int, default=5)
+    parser.add_argument("--output_path", type=str, default="output")
+    parser.add_argument("--num_sample_workers", type=int, default=1)
+    parser.add_argument("--steps_per_save", type=int, default=3000)
+    parser.add_argument("--num_nodes", type=int, default=10000)
+    parser.add_argument("--edge_path", type=str, default="./graph_data")
+    parser.add_argument("--walkpath_files", type=str, default=None)
+    parser.add_argument("--train_files", type=str, default="./train_data")
+    parser.add_argument("--warm_start_from_dir", type=str, default=None)
+    parser.add_argument(
+        "--neg_sample_type",
+        type=str,
+        default="average",
+        choices=["average", "outdegree"])
+    parser.add_argument(
+        "--optimizer",
+        type=str,
+        required=False,
+        choices=['adam', 'sgd'],
+        default="adam")
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/distribute_deepwalk/job.sh
+++ b/examples/distribute_deepwalk/job.sh
+#!/bin/bash
+set -x
+source ./pgl_deepwalk.cfg
+export CPU_NUM=$CPU_NUM
+export FLAGS_rpc_deadline=3000000 
+export FLAGS_communicator_send_queue_size=1
+export FLAGS_communicator_min_send_grad_num_before_recv=0
+export FLAGS_communicator_max_merge_var_num=1
+export FLAGS_communicator_merge_sparse_grad=1
+if [[ $build_train_data == True ]];then
+    train_files="./train_data"
+else
+    train_files="None"
+fi
+if [[ $pre_walk == True ]]; then
+    walkpath_files="./walk_path"
+    if [[ $TRAINING_ROLE == "PSERVER" ]];then
+        while [[ ! -d train_data ]];do
+            sleep 60
+            echo "Waiting for train_data ..."
+        done
+        rm -rf $walkpath_files && mkdir -p $walkpath_files
+        python -u cluster_train.py --num_sample_workers $num_sample_workers --num_nodes $num_nodes --mode walk --walkpath_files $walkpath_files --epoch $epoch \
+             --walk_len $walk_len --batch_size $batch_size --train_files $train_files --dataset "BlogCatalog"
+        touch build_graph_done
+    fi
+    while [[ ! -f build_graph_done ]];do
+        sleep 60
+        echo "Waiting for walk_path ..."
+    done
+else
+    walkpath_files="None"
+fi
+python -u cluster_train.py --num_sample_workers $num_sample_workers --num_nodes $num_nodes --optimizer $optimizer --walkpath_files $walkpath_files --epoch $epoch \
+            --is_distributed $distributed_embedding --lr $learning_rate --neg_num $neg_num --walk_len $walk_len --win_size $win_size --is_sparse $is_sparse --hidden_size $dim \
+            --batch_size $batch_size --steps_per_save $steps_per_save --train_files $train_files --dataset "BlogCatalog"
--- a/examples/distribute_deepwalk/local_config
+++ b/examples/distribute_deepwalk/local_config
+#!/bin/bash 
+export PADDLE_TRAINERS_NUM=2
+export PADDLE_PSERVERS_NUM=2
+export PADDLE_PORT=6184,6185
+export PADDLE_PSERVERS="127.0.0.1"
+export BASE="./local_dir"
--- a/examples/distribute_deepwalk/model.py
+++ b/examples/distribute_deepwalk/model.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    Deepwalk model file.
+"""
+from __future__ import division
+from __future__ import absolute_import
+from __future__ import print_function
+from __future__ import unicode_literals
+import math
+import paddle.fluid.layers as L
+import paddle.fluid as F
+def split_embedding(input,
+                    dict_size,
+                    hidden_size,
+                    initializer,
+                    name,
+                    num_part=16,
+                    is_sparse=False,
+                    learning_rate=1.0):
+    """ split_embedding
+    """
+    _part_size = hidden_size // num_part
+    if hidden_size % num_part != 0:
+        _part_size += 1
+    output_embedding = []
+    p_num = 0
+    while hidden_size > 0:
+        _part_size = min(_part_size, hidden_size)
+        hidden_size -= _part_size
+        print("part", p_num, "size=", (dict_size, _part_size))
+        part_embedding = L.embedding(
+            input=input,
+            size=(dict_size, _part_size),
+            is_sparse=is_sparse,
+            is_distributed=False,
+            param_attr=F.ParamAttr(
+                name=name + '_part%s' % p_num,
+                initializer=initializer,
+                learning_rate=learning_rate))
+        p_num += 1
+        output_embedding.append(part_embedding)
+    return L.concat(output_embedding, -1)
+class DeepwalkModel(object):
+    def __init__(self,
+                 num_nodes,
+                 hidden_size=16,
+                 neg_num=5,
+                 is_sparse=False,
+                 is_distributed=False,
+                 embedding_lr=1.0):
+        self.pyreader = L.py_reader(
+            capacity=70,
+            shapes=[[-1, 1, 1], [-1, neg_num + 1, 1]],
+            dtypes=['int64', 'int64'],
+            lod_levels=[0, 0],
+            name='train',
+            use_double_buffer=True)
+        self.num_nodes = num_nodes
+        self.neg_num = neg_num
+        self.embed_init = F.initializer.Uniform(
+            low=-1. / math.sqrt(hidden_size), high=1. / math.sqrt(hidden_size))
+        self.is_sparse = is_sparse
+        self.is_distributed = is_distributed
+        self.hidden_size = hidden_size
+        self.loss = None
+        self.embedding_lr = embedding_lr
+        max_hidden_size = int(math.pow(2, 31) / 4 / num_nodes)
+        self.num_part = int(math.ceil(1. * hidden_size / max_hidden_size))
+    def forward(self):
+        src, dsts = L.read_file(self.pyreader)
+        if self.is_sparse:
+            # sparse mode use 2 dims input.
+            src = L.reshape(src, [-1, 1])
+            dsts = L.reshape(dsts, [-1, 1])
+        if self.num_part is not None and self.num_part != 1 and not self.is_distributed:
+            src_embed = split_embedding(
+                src,
+                self.num_nodes,
+                self.hidden_size,
+                self.embed_init,
+                "weight",
+                self.num_part,
+                self.is_sparse,
+                learning_rate=self.embedding_lr)
+            dsts_embed = split_embedding(
+                dsts,
+                self.num_nodes,
+                self.hidden_size,
+                self.embed_init,
+                "weight",
+                self.num_part,
+                self.is_sparse,
+                learning_rate=self.embedding_lr)
+        else:
+            src_embed = L.embedding(
+                src, (self.num_nodes, self.hidden_size),
+                self.is_sparse,
+                self.is_distributed,
+                param_attr=F.ParamAttr(
+                    name="weight",
+                    learning_rate=self.embedding_lr,
+                    initializer=self.embed_init))
+            dsts_embed = L.embedding(
+                dsts, (self.num_nodes, self.hidden_size),
+                self.is_sparse,
+                self.is_distributed,
+                param_attr=F.ParamAttr(
+                    name="weight",
+                    learning_rate=self.embedding_lr,
+                    initializer=self.embed_init))
+        if self.is_sparse:
+            # reshape back
+            src_embed = L.reshape(src_embed, [-1, 1, self.hidden_size])
+            dsts_embed = L.reshape(dsts_embed,
+                                   [-1, self.neg_num + 1, self.hidden_size])
+        logits = L.matmul(
+            src_embed, dsts_embed,
+            transpose_y=True)  # [batch_size, 1, neg_num+1]
+        pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
+                                                    "float32", 1)
+        neg_label = L.fill_constant_batch_size_like(
+            logits, [-1, 1, self.neg_num], "float32", 0)
+        label = L.concat([pos_label, neg_label], -1)
+        pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
+                                                     "float32", self.neg_num)
+        neg_weight = L.fill_constant_batch_size_like(
+            logits, [-1, 1, self.neg_num], "float32", 1)
+        weight = L.concat([pos_weight, neg_weight], -1)
+        weight.stop_gradient = True
+        label.stop_gradient = True
+        loss = L.sigmoid_cross_entropy_with_logits(logits, label)
+        loss = loss * weight
+        loss = L.reduce_mean(loss)
+        loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
+        loss.persistable = True
+        self.loss = loss
+        return loss
--- a/examples/distribute_deepwalk/mp_reader.py
+++ b/examples/distribute_deepwalk/mp_reader.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimized Multiprocessing Reader for PaddlePaddle
+"""
+import multiprocessing
+import numpy as np
+import time
+import paddle.fluid as fluid
+import pyarrow
+def _serialize_serializable(obj):
+    """Serialize Feed Dict
+    """
+    return {"type": type(obj), "data": obj.__dict__}
+def _deserialize_serializable(obj):
+    """Deserialize Feed Dict
+    """
+    val = obj["type"].__new__(obj["type"])
+    val.__dict__.update(obj["data"])
+    return val
+context = pyarrow.default_serialization_context()
+context.register_type(
+    object,
+    "object",
+    custom_serializer=_serialize_serializable,
+    custom_deserializer=_deserialize_serializable)
+def serialize_data(data):
+    """serialize_data"""
+    return pyarrow.serialize(data, context=context).to_buffer().to_pybytes()
+def deserialize_data(data):
+    """deserialize_data"""
+    return pyarrow.deserialize(data, context=context)
+def multiprocess_reader(readers, use_pipe=True, queue_size=1000):
+    """
+    multiprocess_reader use python multi process to read data from readers
+    and then use multiprocess.Queue or multiprocess.Pipe to merge all
+    data. The process number is equal to the number of input readers, each
+    process call one reader.
+    Multiprocess.Queue require the rw access right to /dev/shm, some
+    platform does not support.
+    you need to create multiple readers first, these readers should be independent
+    to each other so that each process can work independently.
+    An example:
+    .. code-block:: python
+        reader0 = reader(["file01", "file02"])
+        reader1 = reader(["file11", "file12"])
+        reader1 = reader(["file21", "file22"])
+        reader = multiprocess_reader([reader0, reader1, reader2],
+            queue_size=100, use_pipe=False)
+    """
+    assert type(readers) is list and len(readers) > 0
+    def _read_into_queue(reader, queue):
+        """read_into_queue"""
+        for sample in reader():
+            if sample is None:
+                raise ValueError("sample has None")
+            queue.put(serialize_data(sample))
+        queue.put(serialize_data(None))
+    def queue_reader():
+        """queue_reader"""
+        queue = multiprocessing.Queue(queue_size)
+        for reader in readers:
+            p = multiprocessing.Process(
+                target=_read_into_queue, args=(reader, queue))
+            p.start()
+        reader_num = len(readers)
+        finish_num = 0
+        while finish_num < reader_num:
+            sample = deserialize_data(queue.get())
+            if sample is None:
+                finish_num += 1
+            else:
+                yield sample
+    def _read_into_pipe(reader, conn):
+        """read_into_pipe"""
+        for sample in reader():
+            if sample is None:
+                raise ValueError("sample has None!")
+            conn.send(serialize_data(sample))
+        conn.send(serialize_data(None))
+        conn.close()
+    def pipe_reader():
+        """pipe_reader"""
+        conns = []
+        for reader in readers:
+            parent_conn, child_conn = multiprocessing.Pipe()
+            conns.append(parent_conn)
+            p = multiprocessing.Process(
+                target=_read_into_pipe, args=(reader, child_conn))
+            p.start()
+        reader_num = len(readers)
+        finish_num = 0
+        conn_to_remove = []
+        finish_flag = np.zeros(len(conns), dtype="int32")
+        while finish_num < reader_num:
+            for conn_id, conn in enumerate(conns):
+                if finish_flag[conn_id] > 0:
+                    continue
+                buff = conn.recv()
+                now = time.time()
+                sample = deserialize_data(buff)
+                out = time.time() - now
+                if sample is None:
+                    finish_num += 1
+                    conn.close()
+                    finish_flag[conn_id] = 1
+                else:
+                    yield sample
+    if use_pipe:
+        return pipe_reader
+    else:
+        return queue_reader
--- a/examples/distribute_deepwalk/multi_class.py
+++ b/examples/distribute_deepwalk/multi_class.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import time
+import math
+import os
+import numpy as np
+import sklearn.metrics
+from sklearn.metrics import f1_score
+import pgl
+from pgl import data_loader
+from pgl.utils import op
+from pgl.utils.logger import log
+import paddle.fluid as fluid
+import paddle.fluid.layers as l
+np.random.seed(123)
+def load(name):
+    if name == 'BlogCatalog':
+        dataset = data_loader.BlogCatalogDataset()
+    else:
+        raise ValueError(name + " dataset doesn't exists")
+    return dataset
+def node_classify_model(graph,
+                        num_labels,
+                        hidden_size=16,
+                        name='node_classify_task'):
+    pyreader = l.py_reader(
+        capacity=70,
+        shapes=[[-1, 1], [-1, num_labels]],
+        dtypes=['int64', 'float32'],
+        lod_levels=[0, 0],
+        name=name + '_pyreader',
+        use_double_buffer=True)
+    nodes, labels = l.read_file(pyreader)
+    embed_nodes = l.embedding(
+        input=nodes,
+        size=[graph.num_nodes, hidden_size],
+        param_attr=fluid.ParamAttr(name='weight'))
+    embed_nodes.stop_gradient = True
+    logits = l.fc(input=embed_nodes, size=num_labels)
+    loss = l.sigmoid_cross_entropy_with_logits(logits, labels)
+    loss = l.reduce_mean(loss)
+    prob = l.sigmoid(logits)
+    topk = l.reduce_sum(labels, -1)
+    return pyreader, loss, prob, labels, topk
+def node_classify_generator(graph,
+                            all_nodes=None,
+                            batch_size=512,
+                            epoch=1,
+                            shuffle=True):
+    if all_nodes is None:
+        all_nodes = np.arange(graph.num_nodes)
+    #labels = (np.random.rand(512, 39) > 0.95).astype(np.float32)
+    def batch_nodes_generator(shuffle=shuffle):
+        perm = np.arange(len(all_nodes), dtype=np.int64)
+        if shuffle:
+            np.random.shuffle(perm)
+        start = 0
+        while start < len(all_nodes):
+            yield all_nodes[perm[start:start + batch_size]]
+            start += batch_size
+    def wrapper():
+        for _ in range(epoch):
+            for batch_nodes in batch_nodes_generator():
+                batch_nodes_expanded = np.expand_dims(batch_nodes,
+                                                      -1).astype(np.int64)
+                batch_labels = graph.node_feat['group_id'][batch_nodes].astype(
+                    np.float32)
+                yield [batch_nodes_expanded, batch_labels]
+    return wrapper
+def topk_f1_score(labels,
+                  probs,
+                  topk_list=None,
+                  average="macro",
+                  threshold=None):
+    assert topk_list is not None or threshold is not None, "one of topklist and threshold should not be None"
+    if threshold is not None:
+        preds = probs > threshold
+    else:
+        preds = np.zeros_like(labels, dtype=np.int64)
+        for idx, (prob, topk) in enumerate(zip(np.argsort(probs), topk_list)):
+            preds[idx][prob[-int(topk):]] = 1
+    return f1_score(labels, preds, average=average)
+def main(args):
+    hidden_size = args.hidden_size
+    epoch = args.epoch
+    ckpt_path = args.ckpt_path
+    threshold = args.threshold
+    dataset = load(args.dataset)
+    if args.batch_size is None:
+        batch_size = len(dataset.train_index)
+    else:
+        batch_size = args.batch_size
+    train_steps = (len(dataset.train_index) // batch_size) * epoch
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    train_prog = fluid.Program()
+    test_prog = fluid.Program()
+    startup_prog = fluid.Program()
+    with fluid.program_guard(train_prog, startup_prog):
+        with fluid.unique_name.guard():
+            train_pyreader, train_loss, train_probs, train_labels, train_topk = node_classify_model(
+                dataset.graph,
+                dataset.num_groups,
+                hidden_size=hidden_size,
+                name='train')
+            lr = l.polynomial_decay(0.025, train_steps, 0.0001)
+            adam = fluid.optimizer.Adam(lr)
+            adam.minimize(train_loss)
+    with fluid.program_guard(test_prog, startup_prog):
+        with fluid.unique_name.guard():
+            test_pyreader, test_loss, test_probs, test_labels, test_topk = node_classify_model(
+                dataset.graph,
+                dataset.num_groups,
+                hidden_size=hidden_size,
+                name='test')
+    test_prog = test_prog.clone(for_test=True)
+    exe = fluid.Executor(place)
+    exe.run(startup_prog)
+    train_pyreader.decorate_tensor_provider(
+        node_classify_generator(
+            dataset.graph,
+            dataset.train_index,
+            batch_size=batch_size,
+            epoch=epoch))
+    test_pyreader.decorate_tensor_provider(
+        node_classify_generator(
+            dataset.graph, dataset.test_index, batch_size=batch_size, epoch=1))
+    def existed_params(var):
+        if not isinstance(var, fluid.framework.Parameter):
+            return False
+        return os.path.exists(os.path.join(ckpt_path, var.name))
+    fluid.io.load_vars(
+        exe, ckpt_path, main_program=train_prog, predicate=existed_params)
+    step = 0
+    prev_time = time.time()
+    train_pyreader.start()
+    while 1:
+        try:
+            train_loss_val, train_probs_val, train_labels_val, train_topk_val = exe.run(
+                train_prog,
+                fetch_list=[
+                    train_loss, train_probs, train_labels, train_topk
+                ],
+                return_numpy=True)
+            train_macro_f1 = topk_f1_score(train_labels_val, train_probs_val,
+                                           train_topk_val, "macro", threshold)
+            train_micro_f1 = topk_f1_score(train_labels_val, train_probs_val,
+                                           train_topk_val, "micro", threshold)
+            step += 1
+            log.info("Step %d " % step + "Train Loss: %f " % train_loss_val +
+                     "Train Macro F1: %f " % train_macro_f1 +
+                     "Train Micro F1: %f " % train_micro_f1)
+        except fluid.core.EOFException:
+            train_pyreader.reset()
+            break
+        test_pyreader.start()
+        test_probs_vals, test_labels_vals, test_topk_vals = [], [], []
+        while 1:
+            try:
+                test_loss_val, test_probs_val, test_labels_val, test_topk_val = exe.run(
+                    test_prog,
+                    fetch_list=[
+                        test_loss, test_probs, test_labels, test_topk
+                    ],
+                    return_numpy=True)
+                test_probs_vals.append(
+                    test_probs_val), test_labels_vals.append(test_labels_val)
+                test_topk_vals.append(test_topk_val)
+            except fluid.core.EOFException:
+                test_pyreader.reset()
+                test_probs_array = np.concatenate(test_probs_vals)
+                test_labels_array = np.concatenate(test_labels_vals)
+                test_topk_array = np.concatenate(test_topk_vals)
+                test_macro_f1 = topk_f1_score(
+                    test_labels_array, test_probs_array, test_topk_array,
+                    "macro", threshold)
+                test_micro_f1 = topk_f1_score(
+                    test_labels_array, test_probs_array, test_topk_array,
+                    "micro", threshold)
+                log.info("\t\tStep %d " % step + "Test Loss: %f " %
+                         test_loss_val + "Test Macro F1: %f " % test_macro_f1 +
+                         "Test Micro F1: %f " % test_micro_f1)
+                break
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='node2vec')
+    parser.add_argument(
+        "--dataset",
+        type=str,
+        default="BlogCatalog",
+        help="dataset (BlogCatalog)")
+    parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
+    parser.add_argument("--hidden_size", type=int, default=128)
+    parser.add_argument("--epoch", type=int, default=400)
+    parser.add_argument("--batch_size", type=int, default=None)
+    parser.add_argument("--threshold", type=float, default=0.3)
+    parser.add_argument(
+        "--ckpt_path",
+        type=str,
+        default="./tmp/baseline_node2vec/paddle_model")
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/distribute_deepwalk/pgl_deepwalk.cfg
+++ b/examples/distribute_deepwalk/pgl_deepwalk.cfg
+# deepwalk config
+num_nodes=10312 # max node_id + 1
+num_sample_workers=2
+epoch=100
+optimizer=sgd # sgd or adam
+learning_rate=0.5
+neg_num=5
+walk_len=40
+win_size=5
+dim=128
+batch_size=8
+steps_per_save=5000
+is_sparse=False
+distributed_embedding=False # only use when num_nodes > 100,000,000, slower than noraml embedding
+build_train_data=True
+pre_walk=False
+CPU_NUM=16
--- a/examples/distribute_deepwalk/reader.py
+++ b/examples/distribute_deepwalk/reader.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    Reader file.
+"""
+from __future__ import division
+from __future__ import absolute_import
+from __future__ import print_function
+import time
+import io
+import os
+import numpy as np
+import paddle
+from pgl.utils.logger import log
+from pgl.sample import node2vec_sample
+from pgl.sample import deepwalk_sample
+from pgl.sample import alias_sample
+from pgl.graph_kernel import skip_gram_gen_pair
+from pgl.graph_kernel import alias_sample_build_table
+from pgl.utils import mp_reader
+class DeepwalkReader(object):
+    def __init__(self,
+                 graph,
+                 batch_size=512,
+                 walk_len=40,
+                 win_size=5,
+                 neg_num=5,
+                 train_files=None,
+                 walkpath_files=None,
+                 neg_sample_type="average"):
+        """
+        Args:
+            walkpath_files: if is not None, read walk path from walkpath_files
+        """
+        self.graph = graph
+        self.batch_size = batch_size
+        self.walk_len = walk_len
+        self.win_size = win_size
+        self.neg_num = neg_num
+        self.train_files = train_files
+        self.walkpath_files = walkpath_files
+        self.neg_sample_type = neg_sample_type
+    def walk_from_files(self):
+        bucket = []
+        while True:
+            for filename in self.walkpath_files:
+                with io.open(filename) as inf:
+                    for line in inf:
+                        #walk = [hash_map[x] for x in line.strip('\n\t').split('\t')]
+                        walk = [int(x) for x in line.strip('\n\t').split('\t')]
+                        bucket.append(walk)
+                        if len(bucket) == self.batch_size:
+                            yield bucket
+                            bucket = []
+            if len(bucket):
+                yield bucket
+    def walk_from_graph(self):
+        def node_generator():
+            if self.train_files is None:
+                while True:
+                    for nodes in self.graph.node_batch_iter(self.batch_size):
+                        yield nodes
+            else:
+                nodes = []
+                while True:
+                    for filename in self.train_files:
+                        with io.open(filename) as inf:
+                            for line in inf:
+                                node = int(line.strip('\n\t'))
+                                nodes.append(node)
+                                if len(nodes) == self.batch_size:
+                                    yield nodes
+                                    nodes = []
+                if len(nodes):
+                    yield nodes
+        if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
+            log.info("Deepwalk using alias sample")
+        for nodes in node_generator():
+            if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
+                walks = deepwalk_sample(self.graph, nodes, self.walk_len,
+                                        "alias", "events")
+            else:
+                walks = deepwalk_sample(self.graph, nodes, self.walk_len)
+            yield walks
+    def walk_generator(self):
+        if self.walkpath_files is not None:
+            for i in self.walk_from_files():
+                yield i
+        else:
+            for i in self.walk_from_graph():
+                yield i
+    def __call__(self):
+        np.random.seed(os.getpid())
+        if self.neg_sample_type == "outdegree":
+            outdegree = self.graph.outdegree()
+            distribution = 1. * outdegree / outdegree.sum()
+            alias, events = alias_sample_build_table(distribution)
+        max_len = int(self.batch_size * self.walk_len * (
+            (1 + self.win_size) - 0.3))
+        for walks in self.walk_generator():
+            try:
+                src_list, pos_list = [], []
+                for walk in walks:
+                    s, p = skip_gram_gen_pair(walk, self.win_size)
+                    src_list.append(s[:max_len]), pos_list.append(p[:max_len])
+                src = [s for x in src_list for s in x]
+                pos = [s for x in pos_list for s in x]
+                src = np.array(src, dtype=np.int64),
+                pos = np.array(pos, dtype=np.int64)
+                src, pos = np.reshape(src, [-1, 1, 1]), np.reshape(pos,
+                                                                   [-1, 1, 1])
+                neg_sample_size = [len(pos), self.neg_num, 1]
+                if src.shape[0] == 0:
+                    continue
+                if self.neg_sample_type == "average":
+                    negs = np.random.randint(
+                        low=0, high=self.graph.num_nodes, size=neg_sample_size)
+                elif self.neg_sample_type == "outdegree":
+                    negs = alias_sample(neg_sample_size, alias, events)
+                elif self.neg_sample_type == "inbatch":
+                    pass
+                dst = np.concatenate([pos, negs], 1)
+                # [batch_size, 1, 1] [batch_size, neg_num+1, 1]
+                yield src[:max_len], dst[:max_len]
+            except Exception as e:
+                log.exception(e)
--- a/examples/distribute_deepwalk/utils.py
+++ b/examples/distribute_deepwalk/utils.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    Utils file.
+"""
+from __future__ import division
+from __future__ import absolute_import
+from __future__ import print_function
+from __future__ import unicode_literals
+import os
+import time
+import numpy as np
+from pgl.utils.logger import log
+from pgl.graph import Graph
+from pgl.sample import graph_alias_sample_table
+from reader import DeepwalkReader
+import mp_reader
+def get_file_list(path):
+    filelist = []
+    if os.path.isfile(path):
+        filelist = [path]
+    elif os.path.isdir(path):
+        filelist = [
+            os.path.join(dp, f)
+            for dp, dn, filenames in os.walk(path) for f in filenames
+        ]
+    else:
+        raise ValueError(path + " not supported")
+    return filelist
+def build_graph(num_nodes, edge_path):
+    filelist = []
+    if os.path.isfile(edge_path):
+        filelist = [edge_path]
+    elif os.path.isdir(edge_path):
+        filelist = [
+            os.path.join(dp, f)
+            for dp, dn, filenames in os.walk(edge_path) for f in filenames
+        ]
+    else:
+        raise ValueError(edge_path + " not supported")
+    edges, edge_weight = [], []
+    for name in filelist:
+        with open(name) as inf:
+            for line in inf:
+                slots = line.strip("\n").split()
+                edges.append([slots[0], slots[1]])
+                edges.append([slots[1], slots[0]])
+                if len(slots) > 2:
+                    edge_weight.extend([float(slots[2]), float(slots[2])])
+    edges = np.array(edges, dtype="int64")
+    assert num_nodes > edges.max(
+    ), "Node id in any edges should be smaller then num_nodes!"
+    edge_feat = dict()
+    if len(edge_weight) == len(edges):
+        edge_feat["weight"] = np.array(edge_weight)
+    graph = Graph(num_nodes, edges, edge_feat=edge_feat)
+    log.info("Build graph done")
+    graph.outdegree()
+    del edges, edge_feat
+    log.info("Build graph index done")
+    if "weight" in graph.edge_feat:
+        graph.node_feat["alias"], graph.node_feat[
+            "events"] = graph_alias_sample_table(graph, "weight")
+        log.info("Build graph alias sample table done")
+    return graph
+def build_fake_graph(num_nodes):
+    class FakeGraph():
+        pass
+    graph = FakeGraph()
+    graph.num_nodes = num_nodes
+    return graph
+def build_gen_func(args, graph):
+    num_sample_workers = args.num_sample_workers
+    if args.walkpath_files is None or args.walkpath_files == "None":
+        walkpath_files = [None for _ in range(num_sample_workers)]
+    else:
+        files = get_file_list(args.walkpath_files)
+        walkpath_files = [[] for i in range(num_sample_workers)]
+        for idx, f in enumerate(files):
+            walkpath_files[idx % num_sample_workers].append(f)
+    if args.train_files is None or args.train_files == "None":
+        train_files = [None for _ in range(num_sample_workers)]
+    else:
+        files = get_file_list(args.train_files)
+        train_files = [[] for i in range(num_sample_workers)]
+        for idx, f in enumerate(files):
+            train_files[idx % num_sample_workers].append(f)
+    gen_func_pool = [
+        DeepwalkReader(
+            graph,
+            batch_size=args.batch_size,
+            walk_len=args.walk_len,
+            win_size=args.win_size,
+            neg_num=args.neg_num,
+            neg_sample_type=args.neg_sample_type,
+            walkpath_files=walkpath_files[i],
+            train_files=train_files[i]) for i in range(num_sample_workers)
+    ]
+    if num_sample_workers == 1:
+        gen_func = gen_func_pool[0]
+    else:
+        gen_func = mp_reader.multiprocess_reader(
+            gen_func_pool, use_pipe=True, queue_size=100)
+    return gen_func
+def test_gen_speed(gen_func):
+    cur_time = time.time()
+    for idx, _ in enumerate(gen_func()):
+        log.info("iter %s: %s s" % (idx, time.time() - cur_time))
+        cur_time = time.time()
+        if idx == 100:
+            break
--- a/examples/distribute_graphsage/README.md
+++ b/examples/distribute_graphsage/README.md
+# Distribute GraphSAGE in PGL
+[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
+information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.
+For purpose of high scalability, we use redis as distribute graph storage solution and training graphsage against redis server.
+### Datasets(Quickstart)
+The reddit dataset should be downloaded from [reddit_adj.npz](https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt) and [reddit.npz](https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2Jthe). The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
+Alternatively, reddit dataset has been preprocessed and packed into docker image, which can be instantly pulled using following commands.
+```sh
+docker pull githubutilities/reddit_redis_demo:v0.1
+```
+### Dependencies
+```txt
+- paddlepaddle>=1.6
+- pgl
+- scipy
+- redis==2.10.6
+- redis-py-cluster==1.3.6
+```
+### How to run
+#### 1. Start reddit data service
+```sh
+docker run \
+    --net=host \
+    -d --rm \
+    --name reddit_demo \
+    -it githubutilities/reddit_redis_demo:v0.1 \
+    /bin/bash -c "/bin/bash ./before_hook.sh && /bin/bash"
+docker logs -f `docker ps -aqf "name=reddit_demo"`
+```
+#### 2. training GraphSAGE model
+```sh
+python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --sample_workers 10
+```
+#### Hyperparameters
+- epoch: Number of epochs default (10)
+- use_cuda: Use gpu if assign use_cuda. 
+- graphsage_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm".
+- sample_workers: The number of workers for multiprocessing subgraph sample.
+- lr: Learning rate.
+- batch_size: Batch size.
+- samples_1: The max neighbors for the first hop neighbor sampling. (default: 25)
+- samples_2: The max neighbors for the second hop neighbor sampling. (default: 10)
+- hidden_size: The hidden size of the GraphSAGE models.
--- a/examples/distribute_graphsage/data/reddit_index_label.npz
+++ b/examples/distribute_graphsage/data/reddit_index_label.npz
--- a/examples/distribute_graphsage/model.py
+++ b/examples/distribute_graphsage/model.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle
+import paddle.fluid as fluid
+def copy_send(src_feat, dst_feat, edge_feat):
+    return src_feat["h"]
+def mean_recv(feat):
+    return fluid.layers.sequence_pool(feat, pool_type="average")
+def sum_recv(feat):
+    return fluid.layers.sequence_pool(feat, pool_type="sum")
+def max_recv(feat):
+    return fluid.layers.sequence_pool(feat, pool_type="max")
+def lstm_recv(feat):
+    hidden_dim = 128
+    forward, _ = fluid.layers.dynamic_lstm(
+        input=feat, size=hidden_dim * 4, use_peepholes=False)
+    output = fluid.layers.sequence_last_step(forward)
+    return output
+def graphsage_mean(gw, feature, hidden_size, act, name):
+    msg = gw.send(copy_send, nfeat_list=[("h", feature)])
+    neigh_feature = gw.recv(msg, mean_recv)
+    self_feature = feature
+    self_feature = fluid.layers.fc(self_feature,
+                                   hidden_size,
+                                   act=act,
+                                   name=name + '_l')
+    neigh_feature = fluid.layers.fc(neigh_feature,
+                                    hidden_size,
+                                    act=act,
+                                    name=name + '_r')
+    output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
+    output = fluid.layers.l2_normalize(output, axis=1)
+    return output
+def graphsage_meanpool(gw,
+                       feature,
+                       hidden_size,
+                       act,
+                       name,
+                       inner_hidden_size=512):
+    neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
+    msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
+    neigh_feature = gw.recv(msg, mean_recv)
+    neigh_feature = fluid.layers.fc(neigh_feature,
+                                    hidden_size,
+                                    act=act,
+                                    name=name + '_r')
+    self_feature = feature
+    self_feature = fluid.layers.fc(self_feature,
+                                   hidden_size,
+                                   act=act,
+                                   name=name + '_l')
+    output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
+    output = fluid.layers.l2_normalize(output, axis=1)
+    return output
+def graphsage_maxpool(gw,
+                      feature,
+                      hidden_size,
+                      act,
+                      name,
+                      inner_hidden_size=512):
+    neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
+    msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
+    neigh_feature = gw.recv(msg, max_recv)
+    neigh_feature = fluid.layers.fc(neigh_feature,
+                                    hidden_size,
+                                    act=act,
+                                    name=name + '_r')
+    self_feature = feature
+    self_feature = fluid.layers.fc(self_feature,
+                                   hidden_size,
+                                   act=act,
+                                   name=name + '_l')
+    output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
+    output = fluid.layers.l2_normalize(output, axis=1)
+    return output
+def graphsage_lstm(gw, feature, hidden_size, act, name):
+    inner_hidden_size = 128
+    neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
+    hidden_dim = 128
+    forward_proj = fluid.layers.fc(input=neigh_feature,
+                                   size=hidden_dim * 4,
+                                   bias_attr=False,
+                                   name="lstm_proj")
+    msg = gw.send(copy_send, nfeat_list=[("h", forward_proj)])
+    neigh_feature = gw.recv(msg, lstm_recv)
+    neigh_feature = fluid.layers.fc(neigh_feature,
+                                    hidden_size,
+                                    act=act,
+                                    name=name + '_r')
+    self_feature = feature
+    self_feature = fluid.layers.fc(self_feature,
+                                   hidden_size,
+                                   act=act,
+                                   name=name + '_l')
+    output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
+    output = fluid.layers.l2_normalize(output, axis=1)
+    return output
--- a/examples/distribute_graphsage/reader.py
+++ b/examples/distribute_graphsage/reader.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+import pickle as pkl
+import paddle
+import paddle.fluid as fluid
+import socket
+import pgl
+import time
+from pgl.utils import mp_reader
+from pgl.utils.logger import log
+from pgl import redis_graph
+def node_batch_iter(nodes, node_label, batch_size):
+    """node_batch_iter
+    """
+    perm = np.arange(len(nodes))
+    np.random.shuffle(perm)
+    start = 0
+    while start < len(nodes):
+        index = perm[start:start + batch_size]
+        start += batch_size
+        yield nodes[index], node_label[index]
+def traverse(item):
+    """traverse
+    """
+    if isinstance(item, list) or isinstance(item, np.ndarray):
+        for i in iter(item):
+            for j in traverse(i):
+                yield j
+    else:
+        yield item
+def flat_node_and_edge(nodes, eids):
+    """flat_node_and_edge
+    """
+    nodes = list(set(traverse(nodes)))
+    eids = list(set(traverse(eids)))
+    return nodes, eids
+def worker(batch_info, graph_wrapper, samples):
+    """Worker
+    """
+    def work():
+        """work
+        """
+        redis_configs = [{
+            "host": socket.gethostbyname(socket.gethostname()),
+            "port": 7430
+        }, ]
+        graph = redis_graph.RedisGraph("sub_graph", redis_configs, 64)
+        first = True
+        for batch_train_samples, batch_train_labels in batch_info:
+            start_nodes = batch_train_samples
+            nodes = start_nodes
+            eids = []
+            eid2edges = {}
+            for max_deg in samples:
+                pred, pred_eid = graph.sample_predecessor(
+                    start_nodes, max_degree=max_deg, return_eids=True)
+                for _dst, _srcs, _eids in zip(start_nodes, pred, pred_eid):
+                    for _src, _eid in zip(_srcs, _eids):
+                        eid2edges[_eid] = (_src, _dst)
+                last_nodes = nodes
+                nodes = [nodes, pred]
+                eids = [eids, pred_eid]
+                nodes, eids = flat_node_and_edge(nodes, eids)
+                # Find new nodes
+                start_nodes = list(set(nodes) - set(last_nodes))
+                if len(start_nodes) == 0:
+                    break
+            subgraph = graph.subgraph(
+                nodes=nodes, eid=eids, edges=[eid2edges[e] for e in eids])
+            sub_node_index = subgraph.reindex_from_parrent_nodes(
+                batch_train_samples)
+            feed_dict = graph_wrapper.to_feed(subgraph)
+            feed_dict["node_label"] = np.expand_dims(
+                np.array(
+                    batch_train_labels, dtype="int64"), -1)
+            feed_dict["node_index"] = sub_node_index
+            yield feed_dict
+    return work
+def multiprocess_graph_reader(graph_wrapper,
+                              samples,
+                              node_index,
+                              batch_size,
+                              node_label,
+                              num_workers=4):
+    """multiprocess_graph_reader
+    """
+    def parse_to_subgraph(rd):
+        """parse_to_subgraph
+        """
+        def work():
+            """work
+            """
+            last = time.time()
+            for data in rd():
+                this = time.time()
+                feed_dict = data
+                now = time.time()
+                last = now
+                yield feed_dict
+        return work
+    def reader():
+        """reader"""
+        batch_info = list(
+            node_batch_iter(
+                node_index, node_label, batch_size=batch_size))
+        block_size = int(len(batch_info) / num_workers + 1)
+        reader_pool = []
+        for i in range(num_workers):
+            reader_pool.append(
+                worker(batch_info[block_size * i:block_size * (i + 1)],
+                       graph_wrapper, samples))
+        multi_process_sample = mp_reader.multiprocess_reader(
+            reader_pool, use_pipe=True, queue_size=1000)
+        r = parse_to_subgraph(multi_process_sample)
+        return paddle.reader.buffered(r, 1000)
+    return reader()
--- a/examples/distribute_graphsage/requirements.txt
+++ b/examples/distribute_graphsage/requirements.txt
+scipy
+redis==2.10.6
+redis-py-cluster==1.3.6
--- a/examples/distribute_graphsage/train.py
+++ b/examples/distribute_graphsage/train.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import time
+import numpy as np
+import scipy.sparse as sp
+from sklearn.preprocessing import StandardScaler
+import pgl
+from pgl.utils.logger import log
+from pgl.utils import paddle_helper
+import paddle
+import paddle.fluid as fluid
+import reader
+from model import graphsage_mean, graphsage_meanpool,\
+        graphsage_maxpool, graphsage_lstm
+def load_data():
+    """
+        data from https://github.com/matenure/FastGCN/issues/8
+        reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
+        reddit_index_label is preprocess from reddit.npz without feats key.
+    """
+    data_dir = os.path.dirname(os.path.abspath(__file__))
+    data = np.load(os.path.join(data_dir, "data/reddit_index_label.npz"))
+    num_class = 41
+    train_label = data['y_train']
+    val_label = data['y_val']
+    test_label = data['y_test']
+    train_index = data['train_index']
+    val_index = data['val_index']
+    test_index = data['test_index']
+    return {
+        "train_index": train_index,
+        "train_label": train_label,
+        "val_label": val_label,
+        "val_index": val_index,
+        "test_index": test_index,
+        "test_label": test_label,
+        "num_class": 41
+    }
+def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
+                      hidden_size):
+    node_index = fluid.layers.data(
+        "node_index", shape=[None], dtype="int64", append_batch_size=False)
+    node_label = fluid.layers.data(
+        "node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
+    #feature = fluid.layers.gather(feature, graph_wrapper.node_feat['feats'])
+    feature = graph_wrapper.node_feat['feats']
+    feature.stop_gradient = True
+    for i in range(k_hop):
+        if graphsage_type == 'graphsage_mean':
+            feature = graphsage_mean(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_mean_%s" % i)
+        elif graphsage_type == 'graphsage_meanpool':
+            feature = graphsage_meanpool(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_meanpool_%s" % i)
+        elif graphsage_type == 'graphsage_maxpool':
+            feature = graphsage_maxpool(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_maxpool_%s" % i)
+        elif graphsage_type == 'graphsage_lstm':
+            feature = graphsage_lstm(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_maxpool_%s" % i)
+        else:
+            raise ValueError("graphsage type %s is not"
+                             " implemented" % graphsage_type)
+    feature = fluid.layers.gather(feature, node_index)
+    logits = fluid.layers.fc(feature,
+                             num_class,
+                             act=None,
+                             name='classification_layer')
+    proba = fluid.layers.softmax(logits)
+    loss = fluid.layers.softmax_with_cross_entropy(
+        logits=logits, label=node_label)
+    loss = fluid.layers.mean(loss)
+    acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
+    return loss, acc
+def run_epoch(batch_iter,
+              exe,
+              program,
+              prefix,
+              model_loss,
+              model_acc,
+              epoch,
+              log_per_step=100):
+    batch = 0
+    total_loss = 0.
+    total_acc = 0.
+    total_sample = 0
+    start = time.time()
+    for batch_feed_dict in batch_iter():
+        batch += 1
+        batch_loss, batch_acc = exe.run(program,
+                                        fetch_list=[model_loss, model_acc],
+                                        feed=batch_feed_dict)
+        if batch % log_per_step == 0:
+            log.info("Batch %s %s-Loss %s %s-Acc %s" %
+                     (batch, prefix, batch_loss, prefix, batch_acc))
+        num_samples = len(batch_feed_dict["node_index"])
+        total_loss += batch_loss * num_samples
+        total_acc += batch_acc * num_samples
+        total_sample += num_samples
+    end = time.time()
+    log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
+             (prefix, epoch, total_loss / total_sample,
+              total_acc / total_sample, (end - start) / batch))
+def main(args):
+    data = load_data()
+    log.info("preprocess finish")
+    log.info("Train Examples: %s" % len(data["train_index"]))
+    log.info("Val Examples: %s" % len(data["val_index"]))
+    log.info("Test Examples: %s" % len(data["test_index"]))
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    samples = []
+    if args.samples_1 > 0:
+        samples.append(args.samples_1)
+    if args.samples_2 > 0:
+        samples.append(args.samples_2)
+    with fluid.program_guard(train_program, startup_program):
+        graph_wrapper = pgl.graph_wrapper.GraphWrapper(
+            "sub_graph",
+            fluid.CPUPlace(),
+            node_feat=[('feats', [None, 602], np.dtype('float32'))])
+        model_loss, model_acc = build_graph_model(
+            graph_wrapper,
+            num_class=data["num_class"],
+            hidden_size=args.hidden_size,
+            graphsage_type=args.graphsage_type,
+            k_hop=len(samples))
+    test_program = train_program.clone(for_test=True)
+    with fluid.program_guard(train_program, startup_program):
+        adam = fluid.optimizer.Adam(learning_rate=args.lr)
+        adam.minimize(model_loss)
+    exe = fluid.Executor(place)
+    exe.run(startup_program)
+    train_iter = reader.multiprocess_graph_reader(
+        graph_wrapper,
+        samples=samples,
+        num_workers=args.sample_workers,
+        batch_size=args.batch_size,
+        node_index=data['train_index'],
+        node_label=data["train_label"])
+    val_iter = reader.multiprocess_graph_reader(
+        graph_wrapper,
+        samples=samples,
+        num_workers=args.sample_workers,
+        batch_size=args.batch_size,
+        node_index=data['val_index'],
+        node_label=data["val_label"])
+    test_iter = reader.multiprocess_graph_reader(
+        graph_wrapper,
+        samples=samples,
+        num_workers=args.sample_workers,
+        batch_size=args.batch_size,
+        node_index=data['test_index'],
+        node_label=data["test_label"])
+    for epoch in range(args.epoch):
+        run_epoch(
+            train_iter,
+            program=train_program,
+            exe=exe,
+            prefix="train",
+            model_loss=model_loss,
+            model_acc=model_acc,
+            log_per_step=1,
+            epoch=epoch)
+        run_epoch(
+            val_iter,
+            program=test_program,
+            exe=exe,
+            prefix="val",
+            model_loss=model_loss,
+            model_acc=model_acc,
+            log_per_step=10000,
+            epoch=epoch)
+    run_epoch(
+        test_iter,
+        program=test_program,
+        prefix="test",
+        exe=exe,
+        model_loss=model_loss,
+        model_acc=model_acc,
+        log_per_step=10000,
+        epoch=epoch)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='graphsage')
+    parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
+    parser.add_argument(
+        "--normalize", action='store_true', help="normalize features")
+    parser.add_argument(
+        "--symmetry", action='store_true', help="undirect graph")
+    parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
+    parser.add_argument("--sample_workers", type=int, default=10)
+    parser.add_argument("--epoch", type=int, default=10)
+    parser.add_argument("--hidden_size", type=int, default=128)
+    parser.add_argument("--batch_size", type=int, default=128)
+    parser.add_argument("--lr", type=float, default=0.01)
+    parser.add_argument("--samples_1", type=int, default=25)
+    parser.add_argument("--samples_2", type=int, default=10)
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/gat/README.md
+++ b/examples/gat/README.md
@@ -26,24 +26,25 @@ def gat_layer(graph_wrapper, node_feature, hidden_size):
    return output
 ```
 ### Datasets
 The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
 ### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
+- paddlepaddle>=1.6
 - pgl
 ### Performance
 We train our models for 200 epochs and report the accuracy on the test dataset.
-| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
+| Dataset | Accuracy |
-| --- | --- | --- |---|
+| --- | --- |
-| Cora | ~83% | 0.0188s | 0.0175s | 
+| Cora | ~83% | 
-| Pubmed | ~78% | 0.0449s  | 0.0295s |
+| Pubmed | ~78% |
-| Citeseer | ~70% | 0.0275 | 0.0253s | 
+| Citeseer | ~70% | 
 ### How to run

--- a/examples/gat/train.py
+++ b/examples/gat/train.py
@@ -68,7 +68,7 @@ def main(args):
        node_index = fluid.layers.data(
            "node_index",
            shape=[None, 1],
-            dtype="int32",
+            dtype="int64",
            append_batch_size=False)
        node_label = fluid.layers.data(
            "node_label",
@@ -111,7 +111,7 @@ def main(args):
    for epoch in range(200):
        if epoch >= 3:
            t0 = time.time()
-        feed_dict["node_index"] = np.array(train_index, dtype="int32")
+        feed_dict["node_index"] = np.array(train_index, dtype="int64")
        feed_dict["node_label"] = np.array(train_label, dtype="int64")
        train_loss, train_acc = exe.run(train_program,
                                        feed=feed_dict,
@@ -121,7 +121,7 @@ def main(args):
            time_per_epoch = 1.0 * (time.time() - t0)
            dur.append(time_per_epoch)
-        feed_dict["node_index"] = np.array(val_index, dtype="int32")
+        feed_dict["node_index"] = np.array(val_index, dtype="int64")
        feed_dict["node_label"] = np.array(val_label, dtype="int64")
        val_loss, val_acc = exe.run(test_program,
                                    feed=feed_dict,
@@ -132,7 +132,7 @@ def main(args):
                 "Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
                 + "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
-    feed_dict["node_index"] = np.array(test_index, dtype="int32")
+    feed_dict["node_index"] = np.array(test_index, dtype="int64")
    feed_dict["node_label"] = np.array(test_label, dtype="int64")
    test_loss, test_acc = exe.run(test_program,
                                  feed=feed_dict,

--- a/examples/gcn/README.md
+++ b/examples/gcn/README.md
@@ -26,18 +26,18 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
 ### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
+- paddlepaddle>=1.6
 - pgl
 ### Performance
 We train our models for 200 epochs and report the accuracy on the test dataset.
-| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
+| Dataset | Accuracy |
-| --- | --- | --- |---|
+| --- | --- |
-| Cora | ~81% | 0.0106s | 0.0104s | 
+| Cora | ~81% | 
-| Pubmed | ~79% | 0.0210s  | 0.0154s |
+| Pubmed | ~79% |
-| Citeseer | ~71% | 0.0175s | 0.0177s | 
+| Citeseer | ~71% | 
 ### How to run

--- a/examples/gcn/train.py
+++ b/examples/gcn/train.py
@@ -70,7 +70,7 @@ def main(args):
        node_index = fluid.layers.data(
            "node_index",
            shape=[None, 1],
-            dtype="int32",
+            dtype="int64",
            append_batch_size=False)
        node_label = fluid.layers.data(
            "node_label",
@@ -113,7 +113,7 @@ def main(args):
    for epoch in range(200):
        if epoch >= 3:
            t0 = time.time()
-        feed_dict["node_index"] = np.array(train_index, dtype="int32")
+        feed_dict["node_index"] = np.array(train_index, dtype="int64")
        feed_dict["node_label"] = np.array(train_label, dtype="int64")
        train_loss, train_acc = exe.run(train_program,
                                        feed=feed_dict,
@@ -123,7 +123,7 @@ def main(args):
        if epoch >= 3:
            time_per_epoch = 1.0 * (time.time() - t0)
            dur.append(time_per_epoch)
-        feed_dict["node_index"] = np.array(val_index, dtype="int32")
+        feed_dict["node_index"] = np.array(val_index, dtype="int64")
        feed_dict["node_label"] = np.array(val_label, dtype="int64")
        val_loss, val_acc = exe.run(test_program,
                                    feed=feed_dict,
@@ -134,7 +134,7 @@ def main(args):
                 "Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
                 + "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
-    feed_dict["node_index"] = np.array(test_index, dtype="int32")
+    feed_dict["node_index"] = np.array(test_index, dtype="int64")
    feed_dict["node_label"] = np.array(test_label, dtype="int64")
    test_loss, test_acc = exe.run(test_program,
                                  feed=feed_dict,

--- a/examples/ges/README.md
+++ b/examples/ges/README.md
+# PGL Examples for GES
+[Graph Embedding with Side Information](https://arxiv.org/pdf/1803.02349.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce ges algorithms.
+## Datasets
+The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3). 
+## Dependencies
+- paddlepaddle>=1.6
+- pgl>=1.0.0
+## How to run
+For examples, train ges on cora dataset.
+```sh
+# train deepwalk in distributed mode.
+sh gpu_run.sh
+```
+## Hyperparameters
+- dataset: The citation dataset "BlogCatalog".
+- hidden_size: Hidden size of the embedding. 
+- lr: Learning rate. 
+- neg_num: Number of negative samples.
+- epoch: Number of training epoch.
--- a/examples/ges/gpu_run.sh
+++ b/examples/ges/gpu_run.sh
+#!/bin/bash
+export FLAGS_sync_nccl_allreduce=1
+export FLAGS_eager_delete_tensor_gb=0
+export FLAGS_fraction_of_gpu_memory_to_use=1
+export NCCL_DEBUG=INFO
+export NCCL_IB_GID_INDEX=3
+export GLOG_v=1
+export GLOG_logtostderr=1
+num_nodes=10312
+num_embedding=10351
+num_sample_workers=20
+# build train_data
+rm -rf train_data && mkdir -p train_data 
+cd train_data 
+seq 0 $((num_nodes-1)) | shuf | split -l $((num_nodes/num_sample_workers+1))
+cd - 
+python3 gpu_train.py --output_path ./output  --epoch 100  --walk_len 40 --win_size 5 --neg_num 5 --batch_size 128 --hidden_size 128 \
+    --num_nodes $num_nodes --num_embedding $num_embedding --num_sample_workers $num_sample_workers --steps_per_save 2000 --dataset "BlogCatalog"
--- a/examples/ges/gpu_train.py
+++ b/examples/ges/gpu_train.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" gpu_train
+"""
+import argparse
+import time
+import os
+import glob
+import numpy as np
+import paddle.fluid as F
+import paddle.fluid.layers as L
+from pgl.utils.logger import log
+from pgl.graph import Graph
+from pgl.sample import graph_alias_sample_table
+from pgl import data_loader
+import mp_reader
+from reader import GESReader
+from model import GESModel
+def get_file_list(path):
+    """get_file_list
+    """
+    filelist = []
+    if os.path.isfile(path):
+        filelist = [path]
+    elif os.path.isdir(path):
+        filelist = [
+            os.path.join(dp, f)
+            for dp, dn, filenames in os.walk(path) for f in filenames
+        ]
+    else:
+        raise ValueError(path + " not supported")
+    return filelist
+def build_graph(num_nodes, edge_path, output_path, undigraph=True):
+    """ build_graph
+    """
+    edge_file = os.path.join(output_path, "edge.npy")
+    edge_weight_file = os.path.join(output_path, "edge_weight.npy")
+    alias_file = os.path.join(output_path, "alias.npy")
+    events_file = os.path.join(output_path, "events.npy")
+    if os.path.isfile(edge_file):
+        edges = np.load(edge_file)
+        edge_feat = dict()
+        if os.path.isfile(edge_weight_file):
+            log.info("Loading weight from cache")
+            edge_feat["weight"] = np.load(edge_weight_file, allow_pickle=True)
+        node_feat = dict()
+        if os.path.isfile(alias_file):
+            log.info("Loading alias from cache")
+            node_feat["alias"] = np.load(alias_file, allow_pickle=True)
+        if os.path.isfile(events_file):
+            log.info("Loading events from cache")
+            node_feat["events"] = np.load(events_file, allow_pickle=True)
+    else:
+        filelist = get_file_list(edge_path)
+        edges, edge_weight = [], []
+        log.info("Reading edge files")
+        for name in filelist:
+            with open(name) as inf:
+                for line in inf:
+                    slots = line.strip("\n").split()
+                    edges.append([slots[0], slots[1]])
+                    if len(slots) > 2:
+                        edge_weight.append(slots[2])
+        edges = np.array(edges, dtype="int64")
+        assert num_nodes > edges.max(
+        ), "Node id in any edges should be smaller then num_nodes!"
+        log.info("Read edge files done.")
+        edge_feat = dict()
+        node_feat = dict()
+        if len(edge_weight) == len(edges):
+            edge_feat["weight"] = np.array(edge_weight, dtype="float32")
+    if undigraph is True:
+        edges = np.concatenate([edges, edges[:, [1, 0]]], 0)
+        if "weight" in edge_feat:
+            edge_feat["weight"] = np.concatenate(
+                [edge_feat["weight"], edge_feat["weight"]],
+                0).astype("float64")
+    graph = Graph(num_nodes, edges, node_feat, edge_feat=edge_feat)
+    log.info("Build graph done")
+    graph.outdegree()
+    log.info("Build graph index done")
+    if "weight" in graph.edge_feat and "alias" not in graph.node_feat and "events" not in graph.node_feat:
+        graph.node_feat["alias"], graph.node_feat[
+            "events"] = graph_alias_sample_table(graph, "weight")
+        log.info(
+            "Build graph alias sample table done, and saving alias & evnets cache"
+        )
+        np.save(alias_file, graph.node_feat["alias"])
+        np.save(events_file, graph.node_feat["events"])
+    return graph
+def optimization(base_lr, loss, train_steps, optimizer='adam'):
+    """ optimization
+    """
+    decayed_lr = L.polynomial_decay(base_lr, train_steps, 0.0001)
+    if optimizer == 'sgd':
+        optimizer = F.optimizer.SGD(
+            decayed_lr,
+            regularization=F.regularizer.L2DecayRegularizer(
+                regularization_coeff=0.0025))
+    elif optimizer == 'adam':
+        # dont use gpu's lazy mode
+        optimizer = F.optimizer.Adam(decayed_lr)
+    else:
+        raise ValueError
+    log.info('learning rate:%f' % (base_lr))
+    optimizer.minimize(loss)
+def build_gen_func(args, graph, node_feat):
+    """ build_gen_func
+    """
+    num_sample_workers = args.num_sample_workers
+    if args.walkpath_files is None:
+        walkpath_files = [None for _ in range(num_sample_workers)]
+    else:
+        files = get_file_list(args.walkpath_files)
+        walkpath_files = [[] for i in range(num_sample_workers)]
+        for idx, f in enumerate(files):
+            walkpath_files[idx % num_sample_workers].append(f)
+    if args.train_files is None:
+        train_files = [None for _ in range(num_sample_workers)]
+    else:
+        files = get_file_list(args.train_files)
+        train_files = [[] for i in range(num_sample_workers)]
+        for idx, f in enumerate(files):
+            train_files[idx % num_sample_workers].append(f)
+    gen_func_pool = [
+        GESReader(
+            graph,
+            node_feat,
+            batch_size=args.batch_size,
+            walk_len=args.walk_len,
+            win_size=args.win_size,
+            neg_num=args.neg_num,
+            neg_sample_type=args.neg_sample_type,
+            walkpath_files=walkpath_files[i],
+            train_files=train_files[i]) for i in range(num_sample_workers)
+    ]
+    if num_sample_workers == 1:
+        gen_func = gen_func_pool[0]
+    else:
+        gen_func = mp_reader.multiprocess_reader(
+            gen_func_pool, use_pipe=True, queue_size=100)
+    return gen_func
+def get_parallel_exe(program, loss):
+    """ get_parallel_exe
+    """
+    exec_strategy = F.ExecutionStrategy()
+    exec_strategy.num_threads = 1  #2 for fp32 4 for fp16
+    exec_strategy.use_experimental_executor = True
+    exec_strategy.num_iteration_per_drop_scope = 10  #important shit
+    build_strategy = F.BuildStrategy()
+    build_strategy.enable_inplace = True
+    build_strategy.memory_optimize = True
+    build_strategy.remove_unnecessary_lock = True
+    #return compiled_prog
+    train_exe = F.ParallelExecutor(
+        use_cuda=True,
+        loss_name=loss.name,
+        build_strategy=build_strategy,
+        exec_strategy=exec_strategy,
+        main_program=program)
+    return train_exe
+def train(train_exe, exe, program, loss, node2vec_pyreader, args, train_steps):
+    """ train
+    """
+    trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
+    step = 0
+    while True:
+        try:
+            begin_time = time.time()
+            loss_val, = train_exe.run(fetch_list=[loss])
+            log.info("step %s: loss %.5f speed: %.5f s/step" %
+                     (step, np.mean(loss_val), time.time() - begin_time))
+            step += 1
+        except F.core.EOFException:
+            node2vec_pyreader.reset()
+        if (step % args.steps_per_save == 0 or
+                step == train_steps) and trainer_id == 0:
+            model_save_dir = args.output_path
+            model_path = os.path.join(model_save_dir, str(step))
+            if not os.path.exists(model_save_dir):
+                os.makedirs(model_save_dir)
+            F.io.save_params(exe, model_path, program)
+        if step == train_steps:
+            break
+def test_gen_speed(gen_func):
+    """ test_gen_speed
+    """
+    cur_time = time.time()
+    for idx, _ in enumerate(gen_func()):
+        log.info("iter %s: %s s" % (idx, time.time() - cur_time))
+        cur_time = time.time()
+        if idx == 100:
+            break
+def main(args):
+    """ main
+    """
+    import logging
+    log.setLevel(logging.DEBUG)
+    log.info("start")
+    if args.dataset is not None:
+        if args.dataset == "BlogCatalog":
+            graph = data_loader.BlogCatalogDataset().graph
+        else:
+            raise ValueError(args.dataset + " dataset doesn't exists")
+        log.info("Load buildin BlogCatalog dataset done.")
+        node_feat = np.expand_dims(graph.node_feat["group_id"].argmax(-1),
+                                   -1) + graph.num_nodes
+        args.num_nodes = graph.num_nodes
+        args.num_embedding = graph.num_nodes + graph.node_feat[
+            "group_id"].shape[-1]
+    else:
+        graph = build_graph(args.num_nodes, args.edge_path, args.output_path)
+        node_feat = np.load(args.node_feat_npy)
+    model = GESModel(args.num_embedding, node_feat.shape[1] + 1,
+                     args.hidden_size, args.neg_num, False, 2)
+    pyreader = model.pyreader
+    loss = model.forward()
+    num_devices = len(F.cuda_places())
+    train_steps = int(args.num_nodes * args.epoch / args.batch_size /
+                      num_devices)
+    log.info("Train steps: %s" % train_steps)
+    optimization(args.lr * num_devices, loss, train_steps, args.optimizer)
+    place = F.CUDAPlace(0)
+    exe = F.Executor(place)
+    exe.run(F.default_startup_program())
+    gen_func = build_gen_func(args, graph, node_feat)
+    pyreader.decorate_tensor_provider(gen_func)
+    pyreader.start()
+    train_prog = F.default_main_program()
+    train_exe = get_parallel_exe(train_prog, loss)
+    train(train_exe, exe, train_prog, loss, pyreader, args, train_steps)
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Deepwalk')
+    parser.add_argument("--hidden_size", type=int, default=64)
+    parser.add_argument("--lr", type=float, default=0.025)
+    parser.add_argument("--neg_num", type=int, default=5)
+    parser.add_argument("--epoch", type=int, default=100)
+    parser.add_argument("--batch_size", type=int, default=128)
+    parser.add_argument("--walk_len", type=int, default=40)
+    parser.add_argument("--win_size", type=int, default=5)
+    parser.add_argument("--output_path", type=str, default="output")
+    parser.add_argument("--num_sample_workers", type=int, default=1)
+    parser.add_argument("--steps_per_save", type=int, default=3000)
+    parser.add_argument("--num_nodes", type=int, default=10000)
+    parser.add_argument("--num_embedding", type=int, default=10000)
+    parser.add_argument("--edge_path", type=str, default="./graph_data")
+    parser.add_argument("--walkpath_files", type=str, default=None)
+    parser.add_argument("--train_files", type=str, default="./train_data")
+    parser.add_argument("--node_feat_npy", type=str, default="./feat.npy")
+    parser.add_argument("--dataset", type=str, default=None)
+    parser.add_argument(
+        "--neg_sample_type",
+        type=str,
+        default="average",
+        choices=["average", "outdegree"])
+    parser.add_argument(
+        "--optimizer",
+        type=str,
+        required=False,
+        choices=['adam', 'sgd'],
+        default="adam")
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/ges/model.py
+++ b/examples/ges/model.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    GES model file.
+"""
+from __future__ import division
+from __future__ import absolute_import
+from __future__ import print_function
+from __future__ import unicode_literals
+import math
+import paddle.fluid.layers as L
+import paddle.fluid as F
+def split_embedding(input,
+                    dict_size,
+                    hidden_size,
+                    initializer,
+                    name,
+                    num_part=16,
+                    is_sparse=False,
+                    learning_rate=1.0):
+    """ split_embedding
+    """
+    _part_size = hidden_size // num_part
+    if hidden_size % num_part != 0:
+        _part_size += 1
+    output_embedding = []
+    p_num = 0
+    while hidden_size > 0:
+        _part_size = min(_part_size, hidden_size)
+        hidden_size -= _part_size
+        print("part", p_num, "size=", (dict_size, _part_size))
+        part_embedding = L.embedding(
+            input=input,
+            size=(dict_size, _part_size),
+            is_sparse=is_sparse,
+            is_distributed=False,
+            param_attr=F.ParamAttr(
+                name=name + '_part%s' % p_num,
+                initializer=initializer,
+                learning_rate=learning_rate))
+        p_num += 1
+        output_embedding.append(part_embedding)
+    return L.concat(output_embedding, -1)
+class GESModel(object):
+    """ GESModel
+    """
+    def __init__(self,
+                 num_nodes,
+                 num_featuers,
+                 hidden_size=16,
+                 neg_num=5,
+                 is_sparse=False,
+                 num_part=1):
+        self.pyreader = L.py_reader(
+            capacity=70,
+            shapes=[[-1, 1, num_featuers, 1],
+                    [-1, neg_num + 1, num_featuers, 1]],
+            dtypes=['int64', 'int64'],
+            lod_levels=[0, 0],
+            name='train',
+            use_double_buffer=True)
+        self.num_nodes = num_nodes
+        self.num_featuers = num_featuers
+        self.neg_num = neg_num
+        self.embed_init = F.initializer.TruncatedNormal(scale=1.0 /
+                                                        math.sqrt(hidden_size))
+        self.is_sparse = is_sparse
+        self.num_part = num_part
+        self.hidden_size = hidden_size
+        self.loss = None
+    def forward(self):
+        """ forward
+        """
+        src, dst = L.read_file(self.pyreader)
+        if self.is_sparse:
+            # sparse mode use 2 dims input.
+            src = L.reshape(src, [-1, 1])
+            dst = L.reshape(dst, [-1, 1])
+        src_embed = split_embedding(src, self.num_nodes, self.hidden_size,
+                                    self.embed_init, "weight", self.num_part,
+                                    self.is_sparse)
+        dst_embed = split_embedding(dst, self.num_nodes, self.hidden_size,
+                                    self.embed_init, "weight", self.num_part,
+                                    self.is_sparse)
+        if self.is_sparse:
+            src_embed = L.reshape(
+                src_embed, [-1, 1, self.num_featuers, self.hidden_size])
+            dst_embed = L.reshape(
+                dst_embed,
+                [-1, self.neg_num + 1, self.num_featuers, self.hidden_size])
+        src_embed = L.reduce_mean(src_embed, 2)
+        dst_embed = L.reduce_mean(dst_embed, 2)
+        logits = L.matmul(
+            src_embed, dst_embed,
+            transpose_y=True)  # [batch_size, 1, neg_num+1]
+        pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
+                                                    "float32", 1)
+        neg_label = L.fill_constant_batch_size_like(
+            logits, [-1, 1, self.neg_num], "float32", 0)
+        label = L.concat([pos_label, neg_label], -1)
+        pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
+                                                     "float32", self.neg_num)
+        neg_weight = L.fill_constant_batch_size_like(
+            logits, [-1, 1, self.neg_num], "float32", 1)
+        weight = L.concat([pos_weight, neg_weight], -1)
+        weight.stop_gradient = True
+        label.stop_gradient = True
+        loss = L.sigmoid_cross_entropy_with_logits(logits, label)
+        loss = loss * weight
+        loss = L.reduce_mean(loss)
+        loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
+        loss.persistable = True
+        self.loss = loss
+        return loss
+class EGESModel(GESModel):
+    """ EGESModel
+    """
+    def forward(self):
+        """ forward
+        """
+        src, dst = L.read_file(self.pyreader)
+        src_id = L.slice(src, [0, 1, 2, 3], [0, 0, 0, 0],
+                         [int(math.pow(2, 30)) - 1, 1, 1, 1])
+        dst_id = L.slice(dst, [0, 1, 2, 3], [0, 0, 0, 0],
+                         [int(math.pow(2, 30)) - 1, self.neg_num + 1, 1, 1])
+        if self.is_sparse:
+            # sparse mode use 2 dims input.
+            src = L.reshape(src, [-1, 1])
+            dst = L.reshape(dst, [-1, 1])
+        # [b, 1, f, h]
+        src_embed = split_embedding(src, self.num_nodes, self.hidden_size,
+                                    self.embed_init, "weight", self.num_part,
+                                    self.is_sparse)
+        # [b, n+1, f, h]
+        dst_embed = split_embedding(dst, self.num_nodes, self.hidden_size,
+                                    self.embed_init, "weight", self.num_part,
+                                    self.is_sparse)
+        if self.is_sparse:
+            src_embed = L.reshape(
+                src_embed, [-1, 1, self.num_featuers, self.hidden_size])
+            dst_embed = L.reshape(
+                dst_embed,
+                [-1, self.neg_num + 1, self.num_featuers, self.hidden_size])
+        # [b, 1, 1, f]
+        src_weight = L.softmax(
+            L.embedding(
+                src_id, [self.num_nodes, self.num_featuers],
+                param_attr=F.ParamAttr(name="alpha")))
+        # [b, n+1, 1, f]
+        dst_weight = L.softmax(
+            L.embedding(
+                dst_id, [self.num_nodes, self.num_featuers],
+                param_attr=F.ParamAttr(name="alpha")))
+        # [b, 1, h]
+        src_sum = L.squeeze(L.matmul(src_weight, src_embed), axes=[2])
+        # [b, n+1, h]
+        dst_sum = L.squeeze(L.matmul(dst_weight, dst_embed), axes=[2])
+        logits = L.matmul(
+            src_sum, dst_sum, transpose_y=True)  # [batch_size, 1, neg_num+1]
+        pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
+                                                    "float32", 1)
+        neg_label = L.fill_constant_batch_size_like(
+            logits, [-1, 1, self.neg_num], "float32", 0)
+        label = L.concat([pos_label, neg_label], -1)
+        pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
+                                                     "float32", self.neg_num)
+        neg_weight = L.fill_constant_batch_size_like(
+            logits, [-1, 1, self.neg_num], "float32", 1)
+        weight = L.concat([pos_weight, neg_weight], -1)
+        weight.stop_gradient = True
+        label.stop_gradient = True
+        loss = L.sigmoid_cross_entropy_with_logits(logits, label)
+        loss = loss * weight
+        loss = L.reduce_mean(loss)
+        loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
+        loss.persistable = True
+        self.loss = loss
+        return loss
--- a/examples/ges/mp_reader.py
+++ b/examples/ges/mp_reader.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Optimized Multiprocessing Reader for PaddlePaddle
+"""
+import multiprocessing
+import numpy as np
+import time
+import paddle.fluid as fluid
+import pyarrow
+def _serialize_serializable(obj):
+    """Serialize Feed Dict
+    """
+    return {"type": type(obj), "data": obj.__dict__}
+def _deserialize_serializable(obj):
+    """Deserialize Feed Dict
+    """
+    val = obj["type"].__new__(obj["type"])
+    val.__dict__.update(obj["data"])
+    return val
+context = pyarrow.default_serialization_context()
+context.register_type(
+    object,
+    "object",
+    custom_serializer=_serialize_serializable,
+    custom_deserializer=_deserialize_serializable)
+def serialize_data(data):
+    """serialize_data"""
+    return pyarrow.serialize(data, context=context).to_buffer().to_pybytes()
+def deserialize_data(data):
+    """deserialize_data"""
+    return pyarrow.deserialize(data, context=context)
+def multiprocess_reader(readers, use_pipe=True, queue_size=1000):
+    """
+    multiprocess_reader use python multi process to read data from readers
+    and then use multiprocess.Queue or multiprocess.Pipe to merge all
+    data. The process number is equal to the number of input readers, each
+    process call one reader.
+    Multiprocess.Queue require the rw access right to /dev/shm, some
+    platform does not support.
+    you need to create multiple readers first, these readers should be independent
+    to each other so that each process can work independently.
+    An example:
+    .. code-block:: python
+        reader0 = reader(["file01", "file02"])
+        reader1 = reader(["file11", "file12"])
+        reader1 = reader(["file21", "file22"])
+        reader = multiprocess_reader([reader0, reader1, reader2],
+            queue_size=100, use_pipe=False)
+    """
+    assert type(readers) is list and len(readers) > 0
+    def _read_into_queue(reader, queue):
+        """read_into_queue"""
+        for sample in reader():
+            if sample is None:
+                raise ValueError("sample has None")
+            queue.put(serialize_data(sample))
+        queue.put(serialize_data(None))
+    def queue_reader():
+        """queue_reader"""
+        queue = multiprocessing.Queue(queue_size)
+        for reader in readers:
+            p = multiprocessing.Process(
+                target=_read_into_queue, args=(reader, queue))
+            p.start()
+        reader_num = len(readers)
+        finish_num = 0
+        while finish_num < reader_num:
+            sample = deserialize_data(queue.get())
+            if sample is None:
+                finish_num += 1
+            else:
+                yield sample
+    def _read_into_pipe(reader, conn):
+        """read_into_pipe"""
+        for sample in reader():
+            if sample is None:
+                raise ValueError("sample has None!")
+            conn.send(serialize_data(sample))
+        conn.send(serialize_data(None))
+        conn.close()
+    def pipe_reader():
+        """pipe_reader"""
+        conns = []
+        for reader in readers:
+            parent_conn, child_conn = multiprocessing.Pipe()
+            conns.append(parent_conn)
+            p = multiprocessing.Process(
+                target=_read_into_pipe, args=(reader, child_conn))
+            p.start()
+        reader_num = len(readers)
+        finish_num = 0
+        conn_to_remove = []
+        finish_flag = np.zeros(len(conns), dtype="int32")
+        while finish_num < reader_num:
+            for conn_id, conn in enumerate(conns):
+                if finish_flag[conn_id] > 0:
+                    continue
+                buff = conn.recv()
+                now = time.time()
+                sample = deserialize_data(buff)
+                out = time.time() - now
+                if sample is None:
+                    finish_num += 1
+                    conn.close()
+                    finish_flag[conn_id] = 1
+                else:
+                    yield sample
+    if use_pipe:
+        return pipe_reader
+    else:
+        return queue_reader
--- a/examples/ges/reader.py
+++ b/examples/ges/reader.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+    Reader file.
+"""
+from __future__ import division
+from __future__ import absolute_import
+from __future__ import print_function
+import time
+import io
+import os
+import numpy as np
+import paddle
+from pgl.utils.logger import log
+from pgl.sample import node2vec_sample
+from pgl.sample import deepwalk_sample
+from pgl.sample import alias_sample
+from pgl.graph_kernel import skip_gram_gen_pair
+from pgl.graph_kernel import alias_sample_build_table
+class GESReader(object):
+    """ GESReader
+    """
+    def __init__(self,
+                 graph,
+                 node_feat,
+                 batch_size=512,
+                 walk_len=40,
+                 win_size=5,
+                 neg_num=5,
+                 train_files=None,
+                 walkpath_files=None,
+                 neg_sample_type="average"):
+        """
+        Args:
+            walkpath_files: if is not None, read walk path from walkpath_files
+        """
+        self.graph = graph
+        self.node_feat = node_feat
+        self.batch_size = batch_size
+        self.walk_len = walk_len
+        self.win_size = win_size
+        self.neg_num = neg_num
+        self.train_files = train_files
+        self.walkpath_files = walkpath_files
+        self.neg_sample_type = neg_sample_type
+    def walk_from_files(self):
+        """ walk_from_files
+        """
+        bucket = []
+        while True:
+            for filename in self.walkpath_files:
+                with io.open(filename) as inf:
+                    for line in inf:
+                        walk = [int(x) for x in line.strip('\n\t').split('\t')]
+                        bucket.append(walk)
+                        if len(bucket) == self.batch_size:
+                            yield bucket
+                            bucket = []
+            if len(bucket):
+                yield bucket
+    def walk_from_graph(self):
+        """ walk_from_graph
+        """
+        def node_generator():
+            """ node_generator
+            """
+            if self.train_files is None:
+                while True:
+                    for nodes in self.graph.node_batch_iter(self.batch_size):
+                        yield nodes
+            else:
+                nodes = []
+                while True:
+                    for filename in self.train_files:
+                        with io.open(filename) as inf:
+                            for line in inf:
+                                node = int(line.strip('\n\t'))
+                                nodes.append(node)
+                                if len(nodes) == self.batch_size:
+                                    yield nodes
+                                    nodes = []
+                if len(nodes):
+                    yield nodes
+        if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
+            log.info("Deepwalk using alias sample")
+        for nodes in node_generator():
+            if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
+                walks = deepwalk_sample(self.graph, nodes, self.walk_len,
+                                        "alias", "events")
+            else:
+                walks = deepwalk_sample(self.graph, nodes, self.walk_len)
+            yield walks
+    def walk_generator(self):
+        """ walk_generator
+        """
+        if self.walkpath_files is not None:
+            for i in self.walk_from_files():
+                yield i
+        else:
+            for i in self.walk_from_graph():
+                yield i
+    def __call__(self):
+        np.random.seed(os.getpid())
+        if self.neg_sample_type == "outdegree":
+            outdegree = self.graph.outdegree()
+            distribution = 1. * outdegree / outdegree.sum()
+            alias, events = alias_sample_build_table(distribution)
+        max_len = int(self.batch_size * self.walk_len * (
+            (1 + self.win_size) - 0.3))
+        for walks in self.walk_generator():
+            src, pos = [], []
+            for walk in walks:
+                s, p = skip_gram_gen_pair(walk, self.win_size)
+                src.extend(s), pos.extend(p)
+            src = np.array(src, dtype=np.int64),
+            pos = np.array(pos, dtype=np.int64)
+            src, pos = np.reshape(src, [-1, 1, 1]), np.reshape(pos, [-1, 1, 1])
+            if src.shape[0] == 0:
+                continue
+            neg_sample_size = [len(pos), self.neg_num, 1]
+            if self.neg_sample_type == "average":
+                negs = self.graph.sample_nodes(neg_sample_size)
+            elif self.neg_sample_type == "outdegree":
+                negs = alias_sample(neg_sample_size, alias, events)
+            # [batch_size, 1, 1] [batch_size, neg_num+1, 1]
+            dst = np.concatenate([pos, negs], 1)
+            src_feat = np.concatenate([src, self.node_feat[src[:, :, 0]]], -1)
+            dst_feat = np.concatenate([dst, self.node_feat[dst[:, :, 0]]], -1)
+            src_feat, dst_feat = np.expand_dims(src_feat, -1), np.expand_dims(
+                dst_feat, -1)
+            yield src_feat[:max_len], dst_feat[:max_len]
--- a/examples/graphsage/README.md
+++ b/examples/graphsage/README.md
@@ -12,17 +12,23 @@ The reddit dataset should be downloaded from the following links and placed in d
 ### Dependencies
- sklearn
+- paddlepaddle>=1.6
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
 - pgl
 ### How to run
 To train a GraphSAGE model on Reddit Dataset, you can just run
 ```
 python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry     
 ```
+If you want to train a GraphSAGE model with multiple GPUs, you can just run
+```
+CUDA_VISIBLE_DEVICES=0,1 python train_multi.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry  --num_trainer 2    
+```
 #### Hyperparameters
 - epoch: Number of epochs default (10)

--- a/examples/graphsage/reader.py
+++ b/examples/graphsage/reader.py
@@ -17,12 +17,15 @@ import paddle
 import paddle.fluid as fluid
 import pgl
 import time
+from pgl.utils import mp_reader
 from pgl.utils.logger import log
 import train
 import time
 def node_batch_iter(nodes, node_label, batch_size):
+    """node_batch_iter
+    """
    perm = np.arange(len(nodes))
    np.random.shuffle(perm)
    start = 0
@@ -33,6 +36,8 @@ def node_batch_iter(nodes, node_label, batch_size):
 def traverse(item):
+    """traverse
+    """
    if isinstance(item, list) or isinstance(item, np.ndarray):
        for i in iter(item):
            for j in traverse(i):
@@ -42,13 +47,21 @@ def traverse(item):
 def flat_node_and_edge(nodes, eids):
+    """flat_node_and_edge
+    """
    nodes = list(set(traverse(nodes)))
    eids = list(set(traverse(eids)))
    return nodes, eids
-def worker(batch_info, graph, samples):
+def worker(batch_info, graph, graph_wrapper, samples):
+    """Worker
+    """
    def work():
+        """work
+        """
+        first = True
        for batch_train_samples, batch_train_labels in batch_info:
            start_nodes = batch_train_samples
            nodes = start_nodes
@@ -65,11 +78,14 @@ def worker(batch_info, graph, samples):
                if len(start_nodes) == 0:
                    break
-            feed_dict = {}
+            subgraph = graph.subgraph(nodes=nodes, eid=eids)
-            feed_dict["nodes"] = [int(n) for n in nodes]
+            sub_node_index = subgraph.reindex_from_parrent_nodes(
-            feed_dict["eids"] = [int(e) for e in eids]
+                batch_train_samples)
-            feed_dict["node_label"] = [int(n) for n in batch_train_labels]
+            feed_dict = graph_wrapper.to_feed(subgraph)
-            feed_dict["node_index"] = [int(n) for n in batch_train_samples]
+            feed_dict["node_label"] = np.expand_dims(
+                np.array(
+                    batch_train_labels, dtype="int64"), -1)
+            feed_dict["node_index"] = sub_node_index
            yield feed_dict
    return work
@@ -82,26 +98,28 @@ def multiprocess_graph_reader(graph,
                              batch_size,
                              node_label,
                              num_workers=4):
+    """multiprocess_graph_reader
+    """
    def parse_to_subgraph(rd):
+        """parse_to_subgraph
+        """
        def work():
+            """work
+            """
+            last = time.time()
            for data in rd():
-                nodes = data["nodes"]
+                this = time.time()
-                eids = data["eids"]
+                feed_dict = data
-                batch_train_labels = data["node_label"]
+                now = time.time()
-                batch_train_samples = data["node_index"]
+                last = now
-                subgraph = graph.subgraph(nodes=nodes, eid=eids)
-                sub_node_index = subgraph.reindex_from_parrent_nodes(
-                    batch_train_samples)
-                feed_dict = graph_wrapper.to_feed(subgraph)
-                feed_dict["node_label"] = np.expand_dims(
-                    np.array(
-                        batch_train_labels, dtype="int64"), -1)
-                feed_dict["node_index"] = sub_node_index
                yield feed_dict
        return work
    def reader():
+        """reader"""
        batch_info = list(
            node_batch_iter(
                node_index, node_label, batch_size=batch_size))
@@ -110,9 +128,9 @@ def multiprocess_graph_reader(graph,
        for i in range(num_workers):
            reader_pool.append(
                worker(batch_info[block_size * i:block_size * (i + 1)], graph,
-                       samples))
+                       graph_wrapper, samples))
-        multi_process_sample = paddle.reader.multiprocess_reader(
+        multi_process_sample = mp_reader.multiprocess_reader(
-            reader_pool, use_pipe=False)
+            reader_pool, use_pipe=True, queue_size=1000)
        r = parse_to_subgraph(multi_process_sample)
        return paddle.reader.buffered(r, 1000)
@@ -121,7 +139,10 @@ def multiprocess_graph_reader(graph,
 def graph_reader(graph, graph_wrapper, samples, node_index, batch_size,
                 node_label):
+    """graph_reader"""
    def reader():
+        """reader"""
        for batch_train_samples, batch_train_labels in node_batch_iter(
                node_index, node_label, batch_size=batch_size):
            start_nodes = batch_train_samples

--- a/examples/graphsage/train.py
+++ b/examples/graphsage/train.py
@@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+import os
 import argparse
 import time
@@ -34,8 +35,9 @@ def load_data(normalize=True, symmetry=True):
        reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
        reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
    """
-    data = np.load("data/reddit.npz")
+    data_dir = os.path.dirname(os.path.abspath(__file__))
-    adj = sp.load_npz("data/reddit_adj.npz")
+    data = np.load(os.path.join(data_dir, "data/reddit.npz"))
+    adj = sp.load_npz(os.path.join(data_dir, "data/reddit_adj.npz"))
    if symmetry:
        adj = adj + adj.T
    adj = adj.tocoo()
@@ -64,7 +66,7 @@ def load_data(normalize=True, symmetry=True):
        num_nodes=feature.shape[0],
        edges=list(zip(src, dst)),
        node_feat={"index": np.arange(
-            0, len(feature), dtype="int32")})
+            0, len(feature), dtype="int64")})
    return {
        "graph": graph,
@@ -82,7 +84,7 @@ def load_data(normalize=True, symmetry=True):
 def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
                      hidden_size, feature):
    node_index = fluid.layers.data(
-        "node_index", shape=[None], dtype="int32", append_batch_size=False)
+        "node_index", shape=[None], dtype="int64", append_batch_size=False)
    node_label = fluid.layers.data(
        "node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
@@ -198,7 +200,9 @@ def main(args):
            hide_batch_size=False)
        graph_wrapper = pgl.graph_wrapper.GraphWrapper(
-            "sub_graph", place, node_feat=data['graph'].node_feat_info())
+            "sub_graph",
+            fluid.CPUPlace(),
+            node_feat=data['graph'].node_feat_info())
        model_loss, model_acc = build_graph_model(
            graph_wrapper,
            num_class=data["num_class"],

--- a/examples/graphsage/train_multi.py
+++ b/examples/graphsage/train_multi.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import argparse
+import time
+import sys
+import traceback
+import numpy as np
+import scipy.sparse as sp
+from sklearn.preprocessing import StandardScaler
+import pgl
+from pgl.utils.logger import log
+from pgl.utils import paddle_helper
+import paddle
+import paddle.fluid as fluid
+import reader
+from model import graphsage_mean, graphsage_meanpool,\
+        graphsage_maxpool, graphsage_lstm
+def load_data(normalize=True, symmetry=True):
+    """
+        data from https://github.com/matenure/FastGCN/issues/8
+        reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
+        reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
+    """
+    data_dir = os.path.dirname(os.path.abspath(__file__))
+    data = np.load(os.path.join(data_dir, "data/reddit.npz"))
+    adj = sp.load_npz(os.path.join(data_dir, "data/reddit_adj.npz"))
+    if symmetry:
+        adj = adj + adj.T
+    adj = adj.tocoo()
+    src = adj.row
+    dst = adj.col
+    num_class = 41
+    train_label = data['y_train']
+    val_label = data['y_val']
+    test_label = data['y_test']
+    train_index = data['train_index']
+    val_index = data['val_index']
+    test_index = data['test_index']
+    feature = data["feats"].astype("float32")
+    if normalize:
+        scaler = StandardScaler()
+        scaler.fit(feature[train_index])
+        feature = scaler.transform(feature)
+    log.info("Feature shape %s" % (repr(feature.shape)))
+    graph = pgl.graph.Graph(
+        num_nodes=feature.shape[0],
+        edges=list(zip(src, dst)),
+        node_feat={"feat": feature.astype("float32")})
+    return {
+        "graph": graph,
+        "train_index": train_index,
+        "train_label": train_label,
+        "val_label": val_label,
+        "val_index": val_index,
+        "test_index": test_index,
+        "test_label": test_label,
+        "num_class": 41
+    }
+def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
+                      hidden_size):
+    """build_graph_model"""
+    node_index = fluid.layers.data(
+        "node_index", shape=[None], dtype="int64", append_batch_size=False)
+    node_label = fluid.layers.data(
+        "node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
+    feature = graph_wrapper.node_feat["feat"]
+    for i in range(k_hop):
+        if graphsage_type == 'graphsage_mean':
+            feature = graphsage_mean(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_mean_%s" % i)
+        elif graphsage_type == 'graphsage_meanpool':
+            feature = graphsage_meanpool(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_meanpool_%s" % i)
+        elif graphsage_type == 'graphsage_maxpool':
+            feature = graphsage_maxpool(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_maxpool_%s" % i)
+        elif graphsage_type == 'graphsage_lstm':
+            feature = graphsage_lstm(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_maxpool_%s" % i)
+        else:
+            raise ValueError("graphsage type %s is not"
+                             " implemented" % graphsage_type)
+    feature = fluid.layers.gather(feature, node_index)
+    logits = fluid.layers.fc(feature,
+                             num_class,
+                             act=None,
+                             name='classification_layer')
+    proba = fluid.layers.softmax(logits)
+    loss = fluid.layers.softmax_with_cross_entropy(
+        logits=logits, label=node_label)
+    loss = fluid.layers.mean(loss)
+    acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
+    return loss, acc
+def to_multidevice(batch_iter, num_trainer):
+    """to_multidevice"""
+    batch_dict = []
+    for batch in batch_iter():
+        batch_dict.append(batch)
+        if len(batch_dict) == num_trainer:
+            yield batch_dict
+            batch_dict = []
+    if len(batch_dict) > 0:
+        log.warning("The batch (%s) can't fill all device (%s)"
+                    "which will be discarded." %
+                    (len(batch_dict), num_trainer))
+def run_epoch(batch_iter,
+              exe,
+              program,
+              prefix,
+              model_loss,
+              model_acc,
+              epoch,
+              log_per_step=100,
+              num_trainer=1):
+    """run_epoch"""
+    batch = 0
+    total_loss = 0.
+    total_acc = 0.
+    total_sample = 0
+    start = time.time()
+    if num_trainer > 1:
+        batch_iter = to_multidevice(batch_iter, num_trainer)
+    else:
+        batch_iter = batch_iter()
+    for batch_feed_dict in batch_iter:
+        batch += 1
+        if num_trainer > 1:
+            batch_loss, batch_acc = exe.run(
+                fetch_list=[model_loss.name, model_acc.name],
+                feed=batch_feed_dict)
+            batch_loss = np.mean(batch_loss)
+            batch_acc = np.mean(batch_acc)
+        else:
+            batch_loss, batch_acc = exe.run(
+                program,
+                fetch_list=[model_loss.name, model_acc.name],
+                feed=batch_feed_dict)
+        if batch % log_per_step == 0:
+            log.info("Batch %s %s-Loss %s %s-Acc %s" %
+                     (batch, prefix, batch_loss, prefix, batch_acc))
+        if num_trainer > 1:
+            num_samples = sum(
+                [len(batch["node_index"]) for batch in batch_feed_dict])
+        else:
+            num_samples = len(batch_feed_dict["node_index"])
+        total_loss += batch_loss * num_samples
+        total_acc += batch_acc * num_samples
+        total_sample += num_samples
+    end = time.time()
+    log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
+             (prefix, epoch, total_loss / total_sample,
+              total_acc / total_sample, (end - start) / batch))
+def main(args):
+    """main"""
+    data = load_data(args.normalize, args.symmetry)
+    log.info("preprocess finish")
+    log.info("Train Examples: %s" % len(data["train_index"]))
+    log.info("Val Examples: %s" % len(data["val_index"]))
+    log.info("Test Examples: %s" % len(data["test_index"]))
+    log.info("Num nodes %s" % data["graph"].num_nodes)
+    log.info("Num edges %s" % data["graph"].num_edges)
+    log.info("Average Degree %s" % np.mean(data["graph"].indegree()))
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    samples = []
+    if args.samples_1 > 0:
+        samples.append(args.samples_1)
+    if args.samples_2 > 0:
+        samples.append(args.samples_2)
+    with fluid.program_guard(train_program, startup_program):
+        graph_wrapper = pgl.graph_wrapper.GraphWrapper(
+            "sub_graph",
+            fluid.CPUPlace(),
+            node_feat=data['graph'].node_feat_info())
+        model_loss, model_acc = build_graph_model(
+            graph_wrapper,
+            num_class=data["num_class"],
+            hidden_size=args.hidden_size,
+            graphsage_type=args.graphsage_type,
+            k_hop=len(samples))
+    test_program = train_program.clone(for_test=True)
+    with fluid.program_guard(train_program, startup_program):
+        adam = fluid.optimizer.Adam(learning_rate=args.lr)
+        adam.minimize(model_loss)
+    exe = fluid.Executor(place)
+    exe.run(startup_program)
+    if args.num_trainer > 1:
+        build_strategy = fluid.BuildStrategy()
+        build_strategy.remove_unnecessary_lock = False
+        build_strategy.enable_sequential_execution = True
+        train_exe = fluid.ParallelExecutor(
+            use_cuda=args.use_cuda,
+            main_program=train_program,
+            build_strategy=build_strategy,
+            loss_name=model_loss.name)
+    else:
+        train_exe = exe
+    if args.sample_workers > 1:
+        train_iter = reader.multiprocess_graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            num_workers=args.sample_workers,
+            batch_size=args.batch_size,
+            node_index=data['train_index'],
+            node_label=data["train_label"])
+    else:
+        train_iter = reader.graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            batch_size=args.batch_size,
+            node_index=data['train_index'],
+            node_label=data["train_label"])
+    if args.sample_workers > 1:
+        val_iter = reader.multiprocess_graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            num_workers=args.sample_workers,
+            batch_size=args.batch_size,
+            node_index=data['val_index'],
+            node_label=data["val_label"])
+    else:
+        val_iter = reader.graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            batch_size=args.batch_size,
+            node_index=data['val_index'],
+            node_label=data["val_label"])
+    if args.sample_workers > 1:
+        test_iter = reader.multiprocess_graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            num_workers=args.sample_workers,
+            batch_size=args.batch_size,
+            node_index=data['test_index'],
+            node_label=data["test_label"])
+    else:
+        test_iter = reader.graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            batch_size=args.batch_size,
+            node_index=data['test_index'],
+            node_label=data["test_label"])
+    for epoch in range(args.epoch):
+        run_epoch(
+            train_iter,
+            program=train_program,
+            exe=train_exe,
+            prefix="train",
+            model_loss=model_loss,
+            model_acc=model_acc,
+            num_trainer=args.num_trainer,
+            epoch=epoch)
+        run_epoch(
+            val_iter,
+            program=test_program,
+            exe=exe,
+            prefix="val",
+            model_loss=model_loss,
+            model_acc=model_acc,
+            log_per_step=10000,
+            epoch=epoch)
+    run_epoch(
+        test_iter,
+        program=test_program,
+        prefix="test",
+        exe=exe,
+        model_loss=model_loss,
+        model_acc=model_acc,
+        log_per_step=10000,
+        epoch=epoch)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='graphsage')
+    parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
+    parser.add_argument(
+        "--normalize", action='store_true', help="normalize features")
+    parser.add_argument(
+        "--symmetry", action='store_true', help="undirect graph")
+    parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
+    parser.add_argument("--sample_workers", type=int, default=5)
+    parser.add_argument("--epoch", type=int, default=10)
+    parser.add_argument("--hidden_size", type=int, default=128)
+    parser.add_argument("--batch_size", type=int, default=128)
+    parser.add_argument("--num_trainer", type=int, default=1)
+    parser.add_argument("--lr", type=float, default=0.01)
+    parser.add_argument("--samples_1", type=int, default=25)
+    parser.add_argument("--samples_2", type=int, default=10)
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/graphsage/train_scale.py
+++ b/examples/graphsage/train_scale.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Multi-GPU settings
+"""
+import argparse
+import time
+import numpy as np
+import scipy.sparse as sp
+from sklearn.preprocessing import StandardScaler
+import pgl
+from pgl.utils.logger import log
+from pgl.utils import paddle_helper
+import paddle
+import paddle.fluid as fluid
+import reader
+from model import graphsage_mean, graphsage_meanpool,\
+        graphsage_maxpool, graphsage_lstm
+def fixed_offset(data, num_nodes, scale):
+    """Test
+    """
+    len_data = len(data)
+    len_per_part = int(len_data / scale)
+    offset = np.arange(0, scale, dtype="int64")
+    offset = offset * num_nodes
+    offset = np.repeat(offset, len_per_part)
+    if len(data.shape) > 1:
+        data += offset.reshape([-1, 1])
+    else:
+        data += offset
+def load_data(normalize=True, symmetry=True, scale=1):
+    """
+        data from https://github.com/matenure/FastGCN/issues/8
+        reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
+        reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
+    """
+    data = np.load("data/reddit.npz")
+    adj = sp.load_npz("data/reddit_adj.npz")
+    if symmetry:
+        adj = adj + adj.T
+    adj = adj.tocoo()
+    src = adj.row.reshape([-1, 1])
+    dst = adj.col.reshape([-1, 1])
+    edges = np.hstack([src, dst])
+    num_class = 41
+    train_label = data['y_train']
+    val_label = data['y_val']
+    test_label = data['y_test']
+    train_index = data['train_index']
+    val_index = data['val_index']
+    test_index = data['test_index']
+    feature = data["feats"].astype("float32")
+    if normalize:
+        scaler = StandardScaler()
+        scaler.fit(feature[train_index])
+        feature = scaler.transform(feature)
+    if scale > 1:
+        num_nodes = feature.shape[0]
+        feature = np.tile(feature, [scale, 1])
+        train_label = np.tile(train_label, [scale])
+        val_label = np.tile(val_label, [scale])
+        test_label = np.tile(test_label, [scale])
+        edges = np.tile(edges, [scale, 1])
+        fixed_offset(edges, num_nodes, scale)
+        train_index = np.tile(train_index, [scale])
+        fixed_offset(train_index, num_nodes, scale)
+        val_index = np.tile(val_index, [scale])
+        fixed_offset(val_index, num_nodes, scale)
+        test_index = np.tile(test_index, [scale])
+        fixed_offset(test_index, num_nodes, scale)
+    log.info("Feature shape %s" % (repr(feature.shape)))
+    graph = pgl.graph.Graph(
+        num_nodes=feature.shape[0],
+        edges=edges,
+        node_feat={
+            "index": np.arange(
+                0, len(feature), dtype="int64"),
+            "feature": feature
+        })
+    return {
+        "graph": graph,
+        "train_index": train_index,
+        "train_label": train_label,
+        "val_label": val_label,
+        "val_index": val_index,
+        "test_index": test_index,
+        "test_label": test_label,
+        "feature": feature,
+        "num_class": 41
+    }
+def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
+                      hidden_size, feature):
+    """Test"""
+    node_index = fluid.layers.data(
+        "node_index", shape=[None], dtype="int64", append_batch_size=False)
+    node_label = fluid.layers.data(
+        "node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
+    for i in range(k_hop):
+        if graphsage_type == 'graphsage_mean':
+            feature = graphsage_mean(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_mean_%s % i")
+        elif graphsage_type == 'graphsage_meanpool':
+            feature = graphsage_meanpool(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_meanpool_%s % i")
+        elif graphsage_type == 'graphsage_maxpool':
+            feature = graphsage_maxpool(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_maxpool_%s % i")
+        elif graphsage_type == 'graphsage_lstm':
+            feature = graphsage_lstm(
+                graph_wrapper,
+                feature,
+                hidden_size,
+                act="relu",
+                name="graphsage_maxpool_%s % i")
+        else:
+            raise ValueError("graphsage type %s is not"
+                             " implemented" % graphsage_type)
+    feature = fluid.layers.gather(feature, node_index)
+    logits = fluid.layers.fc(feature,
+                             num_class,
+                             act=None,
+                             name='classification_layer')
+    proba = fluid.layers.softmax(logits)
+    loss = fluid.layers.softmax_with_cross_entropy(
+        logits=logits, label=node_label)
+    loss = fluid.layers.mean(loss)
+    acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
+    return loss, acc
+def run_epoch(batch_iter,
+              exe,
+              program,
+              prefix,
+              model_loss,
+              model_acc,
+              epoch,
+              log_per_step=100):
+    """Test"""
+    batch = 0
+    total_loss = 0.
+    total_acc = 0.
+    total_sample = 0
+    start = time.time()
+    for batch_feed_dict in batch_iter():
+        batch += 1
+        batch_loss, batch_acc = exe.run(program,
+                                        fetch_list=[model_loss, model_acc],
+                                        feed=batch_feed_dict)
+        if batch % log_per_step == 0:
+            log.info("Batch %s %s-Loss %s %s-Acc %s" %
+                     (batch, prefix, batch_loss, prefix, batch_acc))
+        num_samples = len(batch_feed_dict["node_index"])
+        total_loss += batch_loss * num_samples
+        total_acc += batch_acc * num_samples
+        total_sample += num_samples
+    end = time.time()
+    log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
+             (prefix, epoch, total_loss / total_sample,
+              total_acc / total_sample, (end - start) / batch))
+def main(args):
+    """Test """
+    data = load_data(args.normalize, args.symmetry, args.scale)
+    log.info("preprocess finish")
+    log.info("Train Examples: %s" % len(data["train_index"]))
+    log.info("Val Examples: %s" % len(data["val_index"]))
+    log.info("Test Examples: %s" % len(data["test_index"]))
+    log.info("Num nodes %s" % data["graph"].num_nodes)
+    log.info("Num edges %s" % data["graph"].num_edges)
+    log.info("Average Degree %s" % np.mean(data["graph"].indegree()))
+    place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
+    train_program = fluid.Program()
+    startup_program = fluid.Program()
+    samples = []
+    if args.samples_1 > 0:
+        samples.append(args.samples_1)
+    if args.samples_2 > 0:
+        samples.append(args.samples_2)
+    with fluid.program_guard(train_program, startup_program):
+        graph_wrapper = pgl.graph_wrapper.GraphWrapper(
+            "sub_graph",
+            fluid.CPUPlace(),
+            node_feat=data['graph'].node_feat_info())
+        model_loss, model_acc = build_graph_model(
+            graph_wrapper,
+            num_class=data["num_class"],
+            feature=graph_wrapper.node_feat["feature"],
+            hidden_size=args.hidden_size,
+            graphsage_type=args.graphsage_type,
+            k_hop=len(samples))
+    test_program = train_program.clone(for_test=True)
+    if args.sample_workers > 1:
+        train_iter = reader.multiprocess_graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            num_workers=args.sample_workers,
+            batch_size=args.batch_size,
+            node_index=data['train_index'],
+            node_label=data["train_label"])
+    else:
+        train_iter = reader.graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            batch_size=args.batch_size,
+            node_index=data['train_index'],
+            node_label=data["train_label"])
+    if args.sample_workers > 1:
+        val_iter = reader.multiprocess_graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            num_workers=args.sample_workers,
+            batch_size=args.batch_size,
+            node_index=data['val_index'],
+            node_label=data["val_label"])
+    else:
+        val_iter = reader.graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            batch_size=args.batch_size,
+            node_index=data['val_index'],
+            node_label=data["val_label"])
+    if args.sample_workers > 1:
+        test_iter = reader.multiprocess_graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            num_workers=args.sample_workers,
+            batch_size=args.batch_size,
+            node_index=data['test_index'],
+            node_label=data["test_label"])
+    else:
+        test_iter = reader.graph_reader(
+            data['graph'],
+            graph_wrapper,
+            samples=samples,
+            batch_size=args.batch_size,
+            node_index=data['test_index'],
+            node_label=data["test_label"])
+    with fluid.program_guard(train_program, startup_program):
+        adam = fluid.optimizer.Adam(learning_rate=args.lr)
+        adam.minimize(model_loss)
+    exe = fluid.Executor(place)
+    exe.run(startup_program)
+    for epoch in range(args.epoch):
+        run_epoch(
+            train_iter,
+            program=train_program,
+            exe=exe,
+            prefix="train",
+            model_loss=model_loss,
+            model_acc=model_acc,
+            epoch=epoch)
+        run_epoch(
+            val_iter,
+            program=test_program,
+            exe=exe,
+            prefix="val",
+            model_loss=model_loss,
+            model_acc=model_acc,
+            log_per_step=10000,
+            epoch=epoch)
+    run_epoch(
+        test_iter,
+        program=test_program,
+        prefix="test",
+        exe=exe,
+        model_loss=model_loss,
+        model_acc=model_acc,
+        log_per_step=10000,
+        epoch=epoch)
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='graphsage')
+    parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
+    parser.add_argument(
+        "--normalize", action='store_true', help="normalize features")
+    parser.add_argument(
+        "--symmetry", action='store_true', help="undirect graph")
+    parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
+    parser.add_argument("--sample_workers", type=int, default=5)
+    parser.add_argument("--epoch", type=int, default=10)
+    parser.add_argument("--hidden_size", type=int, default=128)
+    parser.add_argument("--batch_size", type=int, default=128)
+    parser.add_argument("--lr", type=float, default=0.01)
+    parser.add_argument("--samples_1", type=int, default=25)
+    parser.add_argument("--samples_2", type=int, default=10)
+    parser.add_argument("--scale", type=int, default=1)
+    args = parser.parse_args()
+    log.info(args)
+    main(args)
--- a/examples/line/README.md
+++ b/examples/line/README.md
+# PGL Examples for LINE
+[LINE](http://www.www2015.it/documents/proceedings/proceedings/p1067.pdf) is an algorithmic framework for embedding very large-scale information networks. It is suitable to a variety of networks including directed, undirected, binary or weighted edges. Based on PGL, we reproduce LINE algorithms and reach the same level of indicators as the paper.
+## Datasets
+[Flickr network](http://socialnetworks.mpi-sws.org/data-imc2007.html) is a social network, which contains 1715256 nodes and 22613981 edges.
+You can dowload data from [here](http://socialnetworks.mpi-sws.org/data-imc2007.html).
+Flickr network contains four files: 
+* flickr-groupmemberships.txt.gz
+* flickr-groups.txt.gz
+* flickr-links.txt.gz
+* flickr-users.txt.gz
+After downloading the data，uncompress them, let's say, in **./data/flickr/** . Note that the current directory is the root directory of LINE model.
+Then you can run the below command to preprocess the data.
+```sh
+python data_process.py
+```
+Then it will produce three files in **./data/flickr/** directory: 
+* nodes.txt
+* edges.txt
+* nodes_label.txt
+## Dependencies
+- paddlepaddle>=1.6
+- pgl
+## How to run
+For examples, use gpu to train LINE on Flickr dataset.
+```sh
+# multiclass task example
+python line.py --use_cuda --order first_order --data_path ./data/flickr/ --save_dir ./checkpoints/model/
+python multi_class.py --ckpt_path ./checkpoints/model/model_eopch_20 --percent 0.5
+```
+## Hyperparameters
+- -use_cuda: Use gpu if assign use_cuda.
+- -order: LINE with First_order Proximity or Second_order Proximity
+- -percent: The percentage of data as training data
+### Experiment results
+Dataset|model|Task|Metric|PGL Result|Reported Result
+--|--|--|--|--|--
+Flickr|LINE with first_order|multi-label classification|MacroF1|0.626|0.627
+Flickr|LINE with first_order|multi-label classification|MicroF1|0.637|0.639
+Flickr|LINE with second_order|multi-label classification|MacroF1|0.615|0.621
+Flickr|LINE with second_order|multi-label classification|MicroF1|0.630|0.635
--- a/examples/line/data_loader.py
+++ b/examples/line/data_loader.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This file provides the Dataset for LINE model.
+"""
+import os
+import io
+import sys
+import numpy as np
+from pgl import graph
+from pgl.utils.logger import log
+class FlickrDataset(object):
+    """Flickr dataset implementation
+    Args:
+        name: The name of the dataset.
+        symmetry_edges: Whether to create symmetry edges.
+        self_loop:  Whether to contain self loop edges.
+        train_percentage: The percentage of nodes to be trained in multi class task.
+    Attributes:
+        graph: The :code:`Graph` data object.
+        num_groups: Number of classes.
+        train_index: The index for nodes in training set.
+        test_index: The index for nodes in validation set.
+    """
+    def __init__(self,
+                 data_path,
+                 symmetry_edges=False,
+                 self_loop=False,
+                 train_percentage=0.5):
+        self.path = data_path
+        #  self.name = name
+        self.num_groups = 5
+        self.symmetry_edges = symmetry_edges
+        self.self_loop = self_loop
+        self.train_percentage = train_percentage
+        self._load_data()
+    def _load_data(self):
+        edge_path = os.path.join(self.path, 'edges.txt')
+        node_path = os.path.join(self.path, 'nodes.txt')
+        nodes_label_path = os.path.join(self.path, 'nodes_label.txt')
+        all_edges = []
+        edges_weight = []
+        with io.open(node_path) as inf:
+            num_nodes = len(inf.readlines())
+        node_feature = np.zeros((num_nodes, self.num_groups))
+        with io.open(nodes_label_path) as inf:
+            for line in inf:
+                # group_id means the label of the node
+                node_id, group_id = line.strip('\n').split(',')
+                node_id = int(node_id) - 1
+                labels = group_id.split(' ')
+                for i in labels:
+                    node_feature[node_id][int(i) - 1] = 1
+        node_degree_list = [1 for _ in range(num_nodes)]
+        with io.open(edge_path) as inf:
+            for line in inf:
+                items = line.strip().split('\t')
+                if len(items) == 2:
+                    u, v = int(items[0]), int(items[1])
+                    weight = 1  # binary weight, default set to 1
+                else:
+                    u, v, weight = int(items[0]), int(items[1]), float(items[
+                        2]),
+                u, v = u - 1, v - 1
+                all_edges.append((u, v))
+                edges_weight.append(weight)
+                if self.symmetry_edges:
+                    all_edges.append((v, u))
+                    edges_weight.append(weight)
+                # sum the weights of the same node as the outdegree
+                node_degree_list[u] += weight
+        if self.self_loop:
+            for i in range(num_nodes):
+                all_edges.append((i, i))
+                edges_weight.append(1.)
+        all_edges = list(set(all_edges))
+        self.graph = graph.Graph(
+            num_nodes=num_nodes,
+            edges=all_edges,
+            node_feat={"group_id": node_feature})
+        perm = np.arange(0, num_nodes)
+        np.random.shuffle(perm)
+        train_num = int(num_nodes * self.train_percentage)
+        self.train_index = perm[:train_num]
+        self.test_index = perm[train_num:]
+        edge_distribution = np.array(edges_weight, dtype=np.float32)
+        self.edge_distribution = edge_distribution / np.sum(edge_distribution)
+        self.edge_sampling = AliasSampling(prob=edge_distribution)
+        node_dist = np.array(node_degree_list, dtype=np.float32)
+        node_negative_distribution = np.power(node_dist, 0.75)
+        self.node_negative_distribution = node_negative_distribution / np.sum(
+            node_negative_distribution)
+        self.node_sampling = AliasSampling(prob=node_negative_distribution)
+        self.node_index = {}
+        self.node_index_reversed = {}
+        for index, e in enumerate(self.graph.edges):
+            self.node_index[e[0]] = index
+            self.node_index_reversed[index] = e[0]
+    def fetch_batch(self,
+                    batch_size=16,
+                    K=10,
+                    edge_sampling='alias',
+                    node_sampling='alias'):
+        """Fetch batch data from dataset.
+        """
+        if edge_sampling == 'numpy':
+            edge_batch_index = np.random.choice(
+                self.graph.num_edges,
+                size=batch_size,
+                p=self.edge_distribution)
+        elif edge_sampling == 'alias':
+            edge_batch_index = self.edge_sampling.sampling(batch_size)
+        elif edge_sampling == 'uniform':
+            edge_batch_index = np.random.randint(
+                0, self.graph.num_edges, size=batch_size)
+        u_i = []
+        u_j = []
+        label = []
+        for edge_index in edge_batch_index:
+            edge = self.graph.edges[edge_index]
+            u_i.append(edge[0])
+            u_j.append(edge[1])
+            label.append(1)
+            for i in range(K):
+                while True:
+                    if node_sampling == 'numpy':
+                        negative_node = np.random.choice(
+                            self.graph.num_nodes,
+                            p=self.node_negative_distribution)
+                    elif node_sampling == 'alias':
+                        negative_node = self.node_sampling.sampling()
+                    elif node_sampling == 'uniform':
+                        negative_node = np.random.randint(0,
+                                                          self.graph.num_nodes)
+                    # make sure the sampled node has no edge with the source node
+                    if not self.graph.has_edges_between(
+                            np.array(
+                                [self.node_index_reversed[negative_node]]),
+                            np.array([self.node_index_reversed[edge[0]]])):
+                        break
+                u_i.append(edge[0])
+                u_j.append(negative_node)
+                label.append(-1)
+        u_i = np.array([u_i], dtype=np.int64).T
+        u_j = np.array([u_j], dtype=np.int64).T
+        label = np.array(label, dtype=np.float32)
+        return u_i, u_j, label
+class AliasSampling:
+    """Implemention of Alias-Method
+    This is an implementation of Alias-Method for sampling efficiently from 
+    a discrete probability distribution.
+    Reference: https://en.wikipedia.org/wiki/Alias_method
+    Args:
+        prob: The discrete probability distribution.
+    """
+    def __init__(self, prob):
+        self.n = len(prob)
+        self.U = np.array(prob) * self.n
+        self.K = [i for i in range(len(prob))]
+        overfull, underfull = [], []
+        for i, U_i in enumerate(self.U):
+            if U_i > 1:
+                overfull.append(i)
+            elif U_i < 1:
+                underfull.append(i)
+        while len(overfull) and len(underfull):
+            i, j = overfull.pop(), underfull.pop()
+            self.K[j] = i
+            self.U[i] = self.U[i] - (1 - self.U[j])
+            if self.U[i] > 1:
+                overfull.append(i)
+            elif self.U[i] < 1:
+                underfull.append(i)
+    def sampling(self, n=1):
+        """Sampling.
+        """
+        x = np.random.rand(n)
+        i = np.floor(self.n * x)
+        y = self.n * x - i
+        i = i.astype(np.int64)
+        res = [i[k] if y[k] < self.U[i[k]] else self.K[i[k]] for k in range(n)]
+        if n == 1:
+            return res[0]
+        else:
+            return res
--- a/examples/line/data_process.py
+++ b/examples/line/data_process.py
+# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+This file preprocess the FlickrDataset for LINE model.
+"""
+import argparse
+import operator
+import os
+def process_data(groupsMemberships_file, flickr_links_file, users_label_file,
+                 edges_file, users_file):
+    """Preprocess flickr network dataset.
+    Args:
+        groupsMemberships_file: flickr-groupmemberships.txt file, 
+            each line is a pair (user, group), which indicates a user belongs to a group.  
+        flickr_links_file: flickr-links.txt file,
+            each line is a pair (user, user), which indicates 
+            the two users have a relationship.
+        users_label_file: each line is a pair (user, list of group),
+            each user may belong to multiple groups.
+        edges_file: each line is a pair (user, user), which indicates 
+            the two users have a relationship. It filters some unused edges.
+        users_file: each line is a int number, which indicates the ID of a user.
+    """
+    group2users = {}
+    with open(groupsMemberships_file, 'r') as f:
+        for line in f:
+            user, group = line.strip().split()
+            try:
+                group2users[int(group)].append(user)
+            except:
+                group2users[int(group)] = [user]
+    # counting how many users belong to every group
+    group2usersNum = {}
+    for key, item in group2users.items():
+        group2usersNum[key] = len(item)
+    groups_sorted_by_usersNum = sorted(
+        group2usersNum.items(), key=operator.itemgetter(1), reverse=True)
+    # the paper only need the 5 groups with the largest number of users
+    label = 1  # remapping the 5 groups from 1 to 5
+    users_label = {}
+    for i in range(5):
+        users_list = group2users[groups_sorted_by_usersNum[i][0]]
+        for user in users_list:
+            # one user may have multi-labels
+            try:
+                users_label[user].append(label)
+            except:
+                users_label[user] = [label]
+        label += 1
+    # remapping the users IDs to make the IDs from 0 to N
+    userID2nodeID = {}
+    count = 1
+    for key in sorted(users_label.keys()):
+        userID2nodeID[key] = count
+        count += 1
+    with open(users_label_file, 'w') as writer:
+        for key in sorted(users_label.keys()):
+            line = ' '.join([str(i) for i in users_label[key]])
+            writer.write(str(userID2nodeID[key]) + ',' + line + '\n')
+    # produce edges file
+    with open(flickr_links_file, 'r') as reader, open(edges_file,
+                                                      'w') as writer:
+        for line in reader:
+            src, dst = line.strip().split('\t')
+            # filter unused user IDs
+            if src in users_label and dst in users_label:
+                # remapping the users IDs
+                src = userID2nodeID[src]
+                dst = userID2nodeID[dst]
+                writer.write(str(src) + '\t' + str(dst) + '\n')
+    # produce nodes file
+    with open(users_file, 'w') as writer:
+        for i in range(1, 1 + len(userID2nodeID)):
+            writer.write(str(i) + '\n')
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description='LINE')
+    parser.add_argument(
+        '--groupmemberships',
+        type=str,
+        default='./data/flickr/flickr-groupmemberships.txt',
+        help='groupmemberships of flickr dataset')
+    parser.add_argument(
+        '--flickr_links',
+        type=str,
+        default='./data/flickr/flickr-links.txt',
+        help='the flickr-links.txt file for training')
+    parser.add_argument(
+        '--nodes_label',
+        type=str,
+        default='./data/flickr/nodes_label.txt',
+        help='nodes (users) label file for training')
+    parser.add_argument(
+        '--edges',
+        type=str,
+        default='./data/flickr/edges.txt',
+        help='the result edges (links) file for training')
+    parser.add_argument(
+        '--nodes',
+        type=str,
+        default='./data/flickr/nodes.txt',
+        help='the nodes (users) file for training')
+    args = parser.parse_args()
+    process_data(args.groupmemberships, args.flickr_links, args.nodes_label,
+                 args.edges, args.nodes)
--- a/examples/line/line.py
+++ b/examples/line/line.py
--- a/examples/line/multi_class.py
+++ b/examples/line/multi_class.py
--- a/examples/sgc/README.md
+++ b/examples/sgc/README.md
--- a/examples/sgc/sgc.py
+++ b/examples/sgc/sgc.py
--- a/examples/static_gat/README.md
+++ b/examples/static_gat/README.md
--- a/examples/static_gat/train.py
+++ b/examples/static_gat/train.py
--- a/examples/static_gcn/README.md
+++ b/examples/static_gcn/README.md
--- a/examples/static_gcn/train.py
+++ b/examples/static_gcn/train.py
--- a/examples/strucvec/README.md
+++ b/examples/strucvec/README.md
--- a/examples/strucvec/classify.py
+++ b/examples/strucvec/classify.py
--- a/examples/strucvec/data_loader.py
+++ b/examples/strucvec/data_loader.py
--- a/examples/strucvec/requirements.txt
+++ b/examples/strucvec/requirements.txt
+gensim
+pathos
+fastdtw
--- a/examples/strucvec/sklearn_classify.py
+++ b/examples/strucvec/sklearn_classify.py
--- a/examples/strucvec/struc2vec.py
+++ b/examples/strucvec/struc2vec.py
--- a/examples/unsup_graphsage/model.py
+++ b/examples/unsup_graphsage/model.py
--- a/examples/unsup_graphsage/reader.py
+++ b/examples/unsup_graphsage/reader.py
--- a/examples/unsup_graphsage/sample.txt
+++ b/examples/unsup_graphsage/sample.txt
--- a/examples/unsup_graphsage/train.py
+++ b/examples/unsup_graphsage/train.py
--- a/pgl/__init__.py
+++ b/pgl/__init__.py
@@ -13,8 +13,9 @@
 # limitations under the License.
 """Generate pgl apis
 """
-__version__ = "0.1.0.beta"
+__version__ = "1.0.0"
 from pgl import layers
 from pgl import graph_wrapper
 from pgl import graph
 from pgl import data_loader
+from pgl import contrib
--- a/pgl/contrib/__init__.py
+++ b/pgl/contrib/__init__.py
--- a/pgl/contrib/heter_graph.py
+++ b/pgl/contrib/heter_graph.py
--- a/pgl/contrib/heter_graph_wrapper.py
+++ b/pgl/contrib/heter_graph_wrapper.py
--- a/pgl/data_loader.py
+++ b/pgl/data_loader.py
--- a/pgl/graph.py
+++ b/pgl/graph.py
--- a/pgl/graph_kernel.pyx
+++ b/pgl/graph_kernel.pyx
--- a/pgl/graph_wrapper.py
+++ b/pgl/graph_wrapper.py
--- a/pgl/layers/conv.py
+++ b/pgl/layers/conv.py
--- a/pgl/layers/set2set.py
+++ b/pgl/layers/set2set.py
--- a/pgl/redis_graph.py
+++ b/pgl/redis_graph.py
--- a/pgl/sample.py
+++ b/pgl/sample.py
--- a/pgl/tests/deepwalk/test_alias_sample.py
+++ b/pgl/tests/deepwalk/test_alias_sample.py
--- a/pgl/tests/test_redis_graph.py
+++ b/pgl/tests/test_redis_graph.py
--- a/pgl/tests/test_redis_graph_conf.json
+++ b/pgl/tests/test_redis_graph_conf.json
--- a/pgl/tests/test_sample.py
+++ b/pgl/tests/test_sample.py
--- a/pgl/tests/test_set2set.py
+++ b/pgl/tests/test_set2set.py
--- a/pgl/utils/mp_reader.py
+++ b/pgl/utils/mp_reader.py
--- a/pgl/utils/mt_reader.py
+++ b/pgl/utils/mt_reader.py
--- a/pgl/utils/paddle_helper.py
+++ b/pgl/utils/paddle_helper.py
--- a/requirements.txt
+++ b/requirements.txt
--- a/setup.py
+++ b/setup.py
--- a/tests/scatter_add_test.py
+++ b/tests/scatter_add_test.py
--- a/tests/unique_with_counts_test.py
+++ b/tests/unique_with_counts_test.py