未验证 提交 0f357b0b 编写于 作者: K kirayummy 提交者: GitHub

Merge pull request #5 from Yelrose/master

Modified README.md
...@@ -33,14 +33,14 @@ Users only need to call the ```sequence_ops``` functions provided by Paddle to e ...@@ -33,14 +33,14 @@ Users only need to call the ```sequence_ops``` functions provided by Paddle to e
``` ```
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions. Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
## Performance ## Performance
We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping. We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
| Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL speed (epoch time) | | Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) |
| -------- | ----- | ----------------- | ------------ | ------------------------------------ | | -------- | ----- | ----------------- | ------------ | ------------------------------------ |
| Cora | GCN |81.75% | 0.0047s | **0.0045s** | | Cora | GCN |81.75% | 0.0047s | **0.0045s** |
| Cora | GAT | 83.5% | **0.0119s** | 0.0141s | | Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
...@@ -49,6 +49,16 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t ...@@ -49,6 +49,16 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t
| Citeseer | GCN |70.2%| **0.0045** |0.0046s| | Citeseer | GCN |70.2%| **0.0045** |0.0046s|
| Citeseer | GAT |68.8%| **0.0124s** |0.0139s| | Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) that aggregates neighbor features with LSTM ignoring the order of recieved messages, the optimized message-passing in DGL will be forced to degenerate into degree bucketing scheme. The speed performance will be much slower than the one implemented in PGL. Performances may be various with different scale of the graph, in our experiments, PGL can reach up to 13 times the speed of DGL.
| Dataset | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | Speed up|
| -------- | ------------ | ------------------------------------ |----|
| Cora | **0.0186s** | 0.1638s | 8.80x|
| Pubmed | **0.0388s** |0.5275s | 13.59x|
| Citeseer | **0.0150s** | 0.1278s | 8.52x |
## System requirements ## System requirements
PGL requires: PGL requires:
......
...@@ -28,7 +28,7 @@ Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/Padd ...@@ -28,7 +28,7 @@ Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/Padd
return fluid.layers.sequence_pool(msg, "sum") return fluid.layers.sequence_pool(msg, "sum")
``` ```
尽管DGL用了一些内核融合(kernel fusion)的方法来将常用的sum,max等聚合函数用scatter-gather进行优化。但是对于**复杂的用户定义函数**,他们使用的Degree Bucketing算法,仅仅使用串行的方案来处理不同的分块,并不同充分利用GPU进行加速。然而,在PGL中我们使用基于LodTensor的消息传递能够充分地利用GPU的并行优化,即使不使用scatter-gather的优化,PGL仍然有高效的性能表现。当然,我们也是提供了scatter优化的聚合函数。 尽管DGL用了一些内核融合(kernel fusion)的方法来将常用的sum,max等聚合函数用scatter-gather进行优化。但是对于**复杂的用户定义函数**,他们使用的Degree Bucketing算法,仅仅使用串行的方案来处理不同的分块,并不同充分利用GPU进行加速。然而,在PGL中我们使用基于LodTensor的消息传递能够充分地利用GPU的并行优化,在复杂的用户定义函数下,PGL的速度在我们的实验中甚至能够达到DGL的13倍。即使不使用scatter-gather的优化,PGL仍然有高效的性能表现。当然,我们也是提供了scatter优化的聚合函数。
## 性能测试 ## 性能测试
...@@ -36,7 +36,7 @@ Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/Padd ...@@ -36,7 +36,7 @@ Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/Padd
我们用Tesla V100-SXM2-16G测试了下列所有的GNN算法,每一个算法跑了200个Epoch来计算平均速度。准确率是在测试集上计算出来的,并且我们没有使用Early-stopping策略。 我们用Tesla V100-SXM2-16G测试了下列所有的GNN算法,每一个算法跑了200个Epoch来计算平均速度。准确率是在测试集上计算出来的,并且我们没有使用Early-stopping策略。
| 数据集 | 模型 | PGL准确率 | PGL速度 (epoch) | DGL速度 (epoch) | | 数据集 | 模型 | PGL准确率 | PGL速度 (epoch) | DGL 0.3.0 速度 (epoch) |
| -------- | ----- | ----------------- | ------------ | ------------------------------------ | | -------- | ----- | ----------------- | ------------ | ------------------------------------ |
| Cora | GCN |81.75% | 0.0047s | **0.0045s** | | Cora | GCN |81.75% | 0.0047s | **0.0045s** |
| Cora | GAT | 83.5% | **0.0119s** | 0.0141s | | Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
...@@ -45,6 +45,15 @@ Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/Padd ...@@ -45,6 +45,15 @@ Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/Padd
| Citeseer | GCN |70.2%| **0.0045** |0.0046s| | Citeseer | GCN |70.2%| **0.0045** |0.0046s|
| Citeseer | GAT |68.8%| **0.0124s** |0.0139s| | Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
如果我们使用复杂的用户定义聚合函数,例如像[GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf)那样忽略邻居信息的获取顺序,利用LSTM来聚合节点的邻居特征。DGL所使用的消息传递函数将退化成Degree Bucketing模式,在这个情况下DGL实现的模型会比PGL的慢的多。模型的性能会随着图规模而变化,在我们的实验中,PGL的速度甚至能够能达到DGL的13倍。
| 数据集 | PGL速度 (epoch) | DGL 0.3.0 速度 (epoch time) | 加速比 |
| -------- | ------------ | ------------------------------------ |----|
| Cora | **0.0186s** | 0.1638s | 8.80x|
| Pubmed | **0.0388s** |0.5275s | 13.59x|
| Citeseer | **0.0150s** | 0.1278s | 8.52x |
## 依赖 ## 依赖
PGL依赖于: PGL依赖于:
......
...@@ -35,8 +35,8 @@ Users only need to call the ```sequence_ops``` functions provided by Paddle to e ...@@ -35,8 +35,8 @@ Users only need to call the ```sequence_ops``` functions provided by Paddle to e
return fluid.layers.sequence_pool(msg, "sum") return fluid.layers.sequence_pool(msg, "sum")
``` ```
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
## Performance ## Performance
...@@ -50,3 +50,11 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t ...@@ -50,3 +50,11 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t
| Pubmed | GAT | 77% |0.0193s|**0.0144s**| | Pubmed | GAT | 77% |0.0193s|**0.0144s**|
| Citeseer | GCN |70.2%| **0.0045** |0.0046s| | Citeseer | GCN |70.2%| **0.0045** |0.0046s|
| Citeseer | GAT |68.8%| **0.0124s** |0.0139s| | Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) that aggregates neighbor features with LSTM ignoring the order of recieved messages, the optimized message-passing in DGL will be forced to degenerate into degree bucketing scheme. The speed performance will be much slower than the one implemented in PGL. Performances may be various with different scale of the graph, in our experiments, PGL can reach up to 13 times the speed of DGL.
| Dataset | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | Speed up|
| -------- | ------------ | ------------------------------------ |----|
| Cora | **0.0186s** | 0.1638s | 8.80x|
| Pubmed | **0.0388s** |0.5275s | 13.59x|
| Citeseer | **0.0150s** | 0.1278s | 8.52x |
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册