README.md 4.5 KB
Newer Older
Y
yelrose 已提交
1 2
# Paddle Graph Learning (PGL) 

H
Huang Zhengjie 已提交
3
[DOC](https://pgl.readthedocs.io/en/latest/) | [Tutorial](https://pgl.readthedocs.io/en/latest/examples/gcn_examples.html)
Y
yelrose 已提交
4 5 6

Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle). 

H
Huang Zhengjie 已提交
7 8

![The Framework of Paddle Graph Learning (PGL)](https://github.com/PaddlePaddle/PGL/blob/master/docs/source/_static/framework_of_pgl.png)
Y
yelrose 已提交
9

H
Huang Zhengjie 已提交
10

Y
yelrose 已提交
11 12 13 14 15 16 17
We provide python interfaces for storing/reading/querying graph structured data and two fundamental computational interfaces, which are walk based paradigm and message-passing based paradigm as shown in the above framework of PGL, for building cutting-edge graph learning algorithms.  Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.


## Highlight: Efficient and Flexible Message Passing Paradigm

One of the most important benefits of graph neural networks compared to other models is the ability to use node-to-node connectivity information, but coding the communication between nodes is very cumbersome. At PGL we adopt **Message Passing Paradigm** similar to [DGL](https://github.com/dmlc/dgl) to help to build a customize graph neural network easily. Users only need to write ```send``` and ```recv``` functions to easily implement a simple GCN. As shown in the following figure, for the first step the send function is defined on the edges of the graph, and the user can customize the send function $\phi^e$ to send the message from the source to the target node. For the second step, the recv function $\phi^v$ is responsible for aggregating $\oplus$ messages together from different sources.

H
Huang Zhengjie 已提交
18
![The basic idea of message passing paradigm](https://github.com/PaddlePaddle/PGL/blob/master/docs/source/_static/message_passing_paradigm.png)
Y
yelrose 已提交
19 20 21 22


As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function $\oplus$ on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**. 

H
Huang Zhengjie 已提交
23

H
Huang Zhengjie 已提交
24
![The parallel degree bucketing of PGL](https://github.com/PaddlePaddle/PGL/blob/master/docs/source/_static/parallel_degree_bucketing.png)
Y
yelrose 已提交
25 26


H
Huang Zhengjie 已提交
27

Y
yelrose 已提交
28 29 30 31 32 33 34
Users only need to call the ```sequence_ops``` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ```sequence_pool``` to sum the neighbor message.
```python
    import paddle.fluid as fluid
    def recv(msg):
        return fluid.layers.sequence_pool(msg, "sum")
```

H
Huang Zhengjie 已提交
35

Y
yelrose 已提交
36 37 38 39
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.

## Performance

H
Huang Zhengjie 已提交
40

Y
yelrose 已提交
41
We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
H
Huang Zhengjie 已提交
42

Y
yelrose 已提交
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76
| Dataset | Model |  PGL Accuracy | PGL speed (epoch time) | DGL speed (epoch time) |
| -------- | ----- | ----------------- | ------------ | ------------------------------------ |
| Cora | GCN |81.75% | 0.0047s | **0.0045s** |
| Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
| Pubmed | GCN |79.2% |**0.0049s** |0.0051s |
| Pubmed | GAT | 77% |0.0193s|**0.0144s**|
| Citeseer | GCN |70.2%| **0.0045** |0.0046s|
| Citeseer | GAT |68.8%| **0.0124s** |0.0139s|

## System requirements

PGL requires:

* paddle >= 1.5
* networkx 


PGL supports both Python 2 & 3


## Installation

pip install pgl




## The Team

PGL is developed and maintained by NLP and Paddle Teams at Baidu

## License

PGL uses Apache License 2.0.