提交 dad7ec3a 编写于 作者: Y yelrose

init commit

上级 cd3076a6
...@@ -186,7 +186,7 @@ ...@@ -186,7 +186,7 @@
same "printed page" as the copyright notice for easier same "printed page" as the copyright notice for easier
identification within third-party archives. identification within third-party archives.
Copyright [yyyy] [name of copyright owner] Copyright 2019 PaddlePaddle Authors.
Licensed under the Apache License, Version 2.0 (the "License"); Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License. you may not use this file except in compliance with the License.
......
# PGL # PGL ReadMe
Paddle Graph Learning # PGL README.md
# Paddle Graph Learning (PGL)
[API](https://xx) | [Tutorials](https://xx)
Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle).
<div>
<div align=center><img src="framework_of_pgl.png" width="700">
<center>The Framework of Paddle Graph Learning (PGL) </center>
<div>
We provide python interfaces for storing/reading/querying graph structured data and two fundamental computational interfaces, which are walk based paradigm and message-passing based paradigm as shown in the above framework of PGL, for building cutting-edge graph learning algorithms. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.
## Highlight: Efficient and Flexible Message Passing Paradigm
One of the most important benefits of graph neural networks compared to other models is the ability to use node-to-node connectivity information, but coding the communication between nodes is very cumbersome. At PGL we adopt **Message Passing Paradigm** similar to [DGL](https://github.com/dmlc/dgl) to help to build a customize graph neural network easily. Users only need to write ```send``` and ```recv``` functions to easily implement a simple GCN. As shown in the following figure, for the first step the send function is defined on the edges of the graph, and the user can customize the send function $\phi^e$ to send the message from the source to the target node. For the second step, the recv function $\phi^v$ is responsible for aggregating $\oplus$ messages together from different sources.
<div>
<div align=center><img src="MPP.png" width="700">
<center>The basic idea of message passing paradigm</center>
<div>
As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function $\oplus$ on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.
<div>
<div align=center><img src="parallel_degree_bucketing.png" width="750">
<center>The parallel degree bucketing of PGL<center>
<div>
Users only need to call the ```sequence_ops``` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ```sequence_pool``` to sum the neighbor message.
```python
import paddle.fluid as fluid
def recv(msg):
return fluid.layers.sequence_pool(msg, "sum")
```
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
## Performance
We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
| Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL speed (epoch time) |
| -------- | ----- | ----------------- | ------------ | ------------------------------------ |
| Cora | GCN |81.75% | 0.0047s | **0.0045s** |
| Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
| Pubmed | GCN |79.2% |**0.0049s** |0.0051s |
| Pubmed | GAT | 77% |0.0193s|**0.0144s**|
| Citeseer | GCN |70.2%| **0.0045** |0.0046s|
| Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
## System requirements
PGL requires:
* paddle >= 1.5
* networkx
PGL supports both Python 2 & 3
## Installation
pip install pgl
## The Team
PGL is developed and maintained by NLP and Paddle Teams at Baidu
## License
PGL uses Apache License 2.0.
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
此差异已折叠。
sphinx==2.1.0
sphinx_rtd_theme
pgl
===
.. toctree::
:maxdepth: 1
pgl
pgl.data\_loader module: Some benchmark datasets.
=================================================
.. automodule:: pgl.data_loader
:members:
:undoc-members:
:show-inheritance:
pgl.graph module: Graph Storage
===============================
.. automodule:: pgl.graph
:members:
:undoc-members:
:show-inheritance:
pgl.graph\_wrapper module: Graph data holders for Paddle GNN.
=========================
.. automodule:: pgl.graph_wrapper
:members:
:undoc-members:
:show-inheritance:
pgl.layers: Predefined graph neural networks layers.
==================
.. automodule:: pgl.layers
:members:
:undoc-members:
:show-inheritance:
API Reference
=============
.. toctree::
pgl.graph
pgl.graph_wrapper
pgl.layers
pgl.data_loader
pgl.utils.paddle_helper
pgl.utils.paddle\_helper module: Some helper function for Paddle.
===============================
.. automodule:: pgl.utils.paddle_helper
:members:
:undoc-members:
:show-inheritance:
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# http://www.sphinx-doc.org/en/master/config
# -- Path setup --------------------------------------------------------------
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
#
"""
conf.py
"""
import os
import sys
sys.path.append(os.path.abspath('../../pgl/'))
sys.path.append(os.path.abspath('..'))
import sphinx_rtd_theme
# -- Project information -----------------------------------------------------
project = 'pgl'
copyright = '2019, PaddlePaddle'
author = 'PaddlePaddle'
# The full version, including alpha/beta/rc tags
release = '0.1.0.beta'
# -- General configuration ---------------------------------------------------
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.todo', 'sphinx.ext.viewcode', 'sphinx.ext.mathjax',
'sphinx.ext.autodoc', 'sphinx.ext.napoleon', "markdown2rst"
]
# Support Inline mathjax
m2r_disable_inline_math = False
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
# This pattern also affects html_static_path and html_extra_path.
source_suffix = ['.rst', '.md']
exclude_patterns = ['pgl.graph_kernel', 'pgl.layers.conv']
lanaguage = "zh_cn"
# -- Options for HTML output -------------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#
html_theme = "sphinx_rtd_theme"
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
html_show_sourcelink = False
#html_logo = 'pgl_logo.png'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
'''
html_theme_options = {
'canonical_url': '',
'analytics_id': 'UA-XXXXXXX-1', # Provided by Google in your dashboard
'logo_only': False,
'display_version': True,
'prev_next_buttons_location': 'bottom',
'style_external_links': False,
'vcs_pageview_mode': '',
'style_nav_header_background': 'white',
# Toc options
'collapse_navigation': True,
'sticky_navigation': True,
'navigation_depth': 4,
'includehidden': True,
'titles_only': False
}
'''
.. mdinclude:: md/gat_examples.md
View the Code
=============
examples/gat/train.py
.. literalinclude:: ../../../examples/gat/train.py
:language: python
:linenos:
.. mdinclude:: md/gcn_examples.md
View the Code
=============
examples/gcn/train.py
.. literalinclude:: ../../../examples/gcn/train.py
:language: python
:linenos:
.. mdinclude:: md/graphsage_examples.md
View the Code
=============
examples/graphsage/train.py
.. literalinclude:: ../../../examples/graphsage/train.py
:language: python
:linenos:
examples/graphsage/reader.py
.. literalinclude:: ../../../examples/graphsage/reader.py
:language: python
:linenos:
examples/graphsage/model.py
.. literalinclude:: ../../../examples/graphsage/model.py
:language: python
:linenos:
# Building Graph Attention Networks
[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks.
### Simple example to build single head GAT
To build a gat layer, one can use our pre-defined ```pgl.layers.gat``` or just write a gat layer with message passing interface.
```python
import paddle.fluid as fluid
def gat_layer(graph_wrapper, node_feature, hidden_size):
def send_func(src_feat, dst_feat, edge_feat):
logits = src_feat["a1"] + dst_feat["a2"]
logits = fluid.layers.leaky_relu(logits, alpha=0.2)
return {"logits": logits, "h": src_feat }
def recv_func(msg):
norm = fluid.layers.sequence_softmax(msg["logits"])
output = msg["h"] * norm
return output
h = fluid.layers.fc(node_feature, hidden_size, bias_attr=False, name="hidden")
a1 = fluid.layers.fc(node_feature, 1, name="a1_weight")
a2 = fluid.layers.fc(node_feature, 1, name="a2_weight")
message = graph_wrapper.send(send_func,
nfeat_list=[("h", h), ("a1", a1), ("a2", a2)])
output = graph_wrapper.recv(recv_func, message)
return output
```
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~83% | 0.0188s | 0.0175s |
| Pubmed | ~78% | 0.0449s | 0.0295s |
| Citeseer | ~70% | 0.0275 | 0.0253s |
### How to run
For examples, use gpu to train gat on cora dataset.
```
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
### View the Code
See the code [here](gat_examples_code.html)
# Building Graph Convolutional Network
[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks.
### Simple example to build GCN
To build a gcn layer, one can use our pre-defined ```pgl.layers.gcn``` or just write a gcn layer with message passing interface.
```python
import paddle.fluid as fluid
def gcn_layer(graph_wrapper, node_feature, hidden_size, act):
def send_func(src_feat, dst_feat, edge_feat):
return src_feat["h"]
def recv_func(msg):
return fluid.layers.sequence_pool(msg, "sum")
message = graph_wrapper.send(send_func, nfeat_list=[("h", node_feature)])
output = graph_wrapper.recv(recv_func, message)
output = fluid.layers.fc(output, size=hidden_size, act=act)
return output
```
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~81% | 0.0106s | 0.0104s |
| Pubmed | ~79% | 0.0210s | 0.0154s |
| Citeseer | ~71% | 0.0175s | 0.0177s |
### How to run
For examples, use gpu to train gcn on cora dataset.
```
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
### View the Code
See the code [here](gcn_examples_code.html)
# GraphSage for Large Scale Networks
[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.
### Datasets
The reddit dataset should be downloaded from the following links and placed in directory ```./data```. The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
- reddit.npz https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
- reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### How to run
To train a GraphSAGE model on Reddit Dataset, you can just run
```
python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry
```
#### Hyperparameters
- epoch: Number of epochs default (10)
- use_cuda: Use gpu if assign use_cuda.
- graphsage_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm".
- normalize: Normalize the input feature if assign normalize.
- sample_workers: The number of workers for multiprocessing subgraph sample.
- lr: Learning rate.
- symmetry: Make the edges symmetric if assign symmetry.
- batch_size: Batch size.
- samples_1: The max neighbors for the first hop neighbor sampling. (default: 25)
- samples_2: The max neighbors for the second hop neighbor sampling. (default: 10)
- hidden_size: The hidden size of the GraphSAGE models.
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Aggregator | Accuracy | Reported in paper |
| --- | --- | --- |
| Mean | 95.70% | 95.0% |
| Meanpool | 95.60% | 94.8% |
| Maxpool | 94.95% | 94.8% |
| LSTM | 95.13% | 95.4% |
### View the Code
See the code [here](graphsage_examples_code.html).
# Graph Representation Learning: Node2vec
[Node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce node2vec algorithms and reach the same level of indicators as the paper.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3) and [Arxiv](http://snap.stanford.edu/data/ca-AstroPh.html).
## Dependencies
- paddlepaddle>=1.4
- pgl
## How to run
For examples, use gpu to train gcn on cora dataset.
```sh
# multiclass task example
python node2vec.py --use_cuda --dataset BlogCatalog --save_path ./tmp/node2vec_BlogCatalog/ --offline_learning --epoch 400
python multi_class.py --use_cuda --ckpt_path ./tmp/node2vec_BlogCatalog/paddle_model --epoch 1000
# link prediction task example
python node2vec.py --use_cuda --dataset ArXiv --save_path
./tmp/node2vec_ArXiv --offline_learning --epoch 10
python link_predict.py --use_cuda --ckpt_path ./tmp/node2vec_ArXiv/paddle_model --epoch 400
```
## Hyperparameters
- dataset: The citation dataset "BlogCatalog" and "ArXiv".
- use_cuda: Use gpu if assign use_cuda.
### Experiment results
Dataset|model|Task|Metric|PGL Result|Reported Result
--|--|--|--|--|--
BlogCatalog|deepwalk|multi-label classification|MacroF1|0.250|0.211
BlogCatalog|node2vec|multi-label classification|MacroF1|0.262|0.258
ArXiv|deepwalk|link prediction|AUC|0.9538|0.9340
ArXiv|node2vec|link prediction|AUC|0.9541|0.9366
## View the Code
See the code [here](node2vec_examples_code.html)
# StaticGraphWrapper for GAT Speed Optimization
[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks.
However, different from the reproduction in **examples/gat**, we use `pgl.graph_wrapper.StaticGraphWrapper` to preload the graph data into gpu or cpu memories which achieves better performance on speed.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gat | Improvement |
| --- | --- | --- |---| --- | --- |
| Cora | ~83% | 0.0145s | 0.0119s | 0.0175s | 1.47x |
| Pubmed | ~78% | 0.0352s | 0.0193s |0.0295s | 1.53x |
| Citeseer | ~70% | 0.0148s | 0.0124s |0.0253s | 2.04x |
### How to run
For examples, use gpu to train gat on cora dataset.
```sh
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
### View the Code
See the code [here](static_gat_examples_code.html)
# StaticGraphWrapper for GCN Speed Optimization
[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks.
However, different from the reproduction in **examples/gcn**, we use `pgl.graph_wrapper.StaticGraphWrapper` to preload the graph data into gpu or cpu memories which achieves better performance on speed.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gcn | Improvement |
| --- | --- | --- |---| --- | --- |
| Cora | ~81% | 0.0053s | 0.0047s | 0.0104s | 2.21x |
| Pubmed | ~79% | 0.0105s | 0.0049s |0.0154s | 3.14x |
| Citeseer | ~71% | 0.0051s | 0.0045s |0.0177s | 3.93x |
### How to run
For examples, use gpu to train gcn on cora dataset.
```sh
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
### View the Code
See the code [here](static_gcn_examples_code.html)
.. mdinclude:: md/node2vec_examples.md
View the Code
=============
examples/node2vec/node2vec.py
.. literalinclude:: ../../../examples/node2vec/node2vec.py
:language: python
:linenos:
examples/node2vec/multi_class.py
.. literalinclude:: ../../../examples/node2vec/multi_class.py
:language: python
:linenos:
examples/node2vec/link_predict.py
.. literalinclude:: ../../../examples/node2vec/link_predict.py
:language: python
:linenos:
.. mdinclude:: md/static_gat_examples.md
View the Code
=============
examples/static_gat/train.py
.. literalinclude:: ../../../examples/static_gat/train.py
:language: python
:linenos:
.. mdinclude:: md/static_gcn_examples.md
View the Code
=============
examples/static_gcn/train.py
.. literalinclude:: ../../../examples/static_gcn/train.py
:language: python
:linenos:
Using StaticGraphWrapper for Speed Optimization
===============================================
.. toctree::
:maxdepth: 1
static_gcn_examples.rst
static_gat_examples.rst
:github_url: https://github.com/PaddlePaddle/PGL
.. toctree::
:maxdepth: 1
:caption: Introduction
:hidden:
introduction.rst
.. mdinclude:: md/introduction.md
Quick Start
===========
.. toctree::
:maxdepth: 1
:caption: Quick Start
:hidden:
instruction.rst
See instruction_ for quick start.
.. _instruction: instruction.html
.. toctree::
:maxdepth: 1
:caption: Examples
examples/gcn_examples.rst
examples/gat_examples.rst
examples/static_graph_wrapper.rst
examples/node2vec_examples.rst
examples/graphsage_examples.rst
.. toctree::
:maxdepth: 2
:caption: API Reference
api/pgl
The Team
========
.. toctree::
:maxdepth: 1
:caption: The Team
:hidden:
team.rst
PGL is developed and maintained by NLP and Paddle Teams at Baidu
License
=======
PGL uses Apache License 2.0.
Quick Start Instructions
========================
Install PGL
-----------
To install Paddle Graph Learning, we need the following packages.
.. code-block:: sh
paddlepaddle >= 1.4 (Faster performance on 1.5)
networkx
cython
We can simply install pgl by pip.
.. code-block:: sh
pip install pgl
.. mdinclude:: md/quick_start.md
.. mdinclude:: md/introduction.md
# Paddle Graph Learning (PGL)
Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle).
<div />
<div align=center><img src="_static/framework_of_pgl.png" width="700"></div>
<center>The Framework of Paddle Graph Learning (PGL)</center>
<div />
We provide python interfaces for storing/reading/querying graph structured data and two fundamental computational interfaces, which are walk based paradigm and message-passing based paradigm as shown in the above framework of PGL, for building cutting-edge graph learning algorithms. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.
## Highlight: Efficient and Flexible <br/> Message Passing Paradigm
<div />
<div align=center><img src="_static/message_passing_paradigm.png" width="700"></div>
<center>The basic idea of message passing paradigm</center>
<div />
As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function $\oplus$ on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.
<div/>
<div align=center><img src="_static/parallel_degree_bucketing.png" width="750"></div>
<center>The parallel degree bucketing of PGL</center>
<div/>
Users only need to call the ```sequence_ops``` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ```sequence_pool``` to sum the neighbor message.
```python
import paddle.fluid as fluid
def recv(msg):
return fluid.layers.sequence_pool(msg, "sum")
```
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
## Performance
We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
| Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL speed (epoch time) |
| -------- | ----- | ----------------- | ------------ | ------------------------------------ |
| Cora | GCN |81.75% | 0.0047s | **0.0045s** |
| Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
| Pubmed | GCN |79.2% |**0.0049s** |0.0051s |
| Pubmed | GAT | 77% |0.0193s|**0.0144s**|
| Citeseer | GCN |70.2%| **0.0045** |0.0046s|
| Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
## Step 1: using PGL to create a graph
Suppose we have a graph with 10 nodes and 14 edges as shown in the following figure:
![A simple graph](_static/quick_start_graph.png)
Our purpose is to train a graph neural network to classify yellow and green nodes. So we can create this graph in such way:
```python
import pgl
from pgl import graph # import pgl module
import numpy as np
def build_graph():
# define the number of nodes; we can use number to represent every node
num_node = 10
# add edges, we represent all edges as a list of tuple (src, dst)
edge_list = [(2, 0), (2, 1), (3, 1),(4, 0), (5, 0),
(6, 0), (6, 4), (6, 5), (7, 0), (7, 1),
(7, 2), (7, 3), (8, 0), (9, 7)]
# Each node can be represented by a d-dimensional feature vector, here for simple, the feature vectors are randomly generated.
d = 16
feature = np.random.randn(num_node, d).astype("float32")
# each edge also can be represented by a feature vector
edge_feature = np.random.randn(len(edge_list), d).astype("float32")
# create a graph
g = graph.Graph(num_nodes = num_node,
edges = edge_list,
node_feat = {'feature':feature},
edge_feat ={'edge_feature': edge_feature})
return g
# create a graph object for saving graph data
g = build_graph()
```
After creating a graph in PGL, we can print out some information in the graph.
```python
print('There are %d nodes in the graph.'%g.num_nodes)
print('There are %d edges in the graph.'%g.num_edges)
# Out:
# There are 10 nodes in the graph.
# There are 14 edges in the graph.
```
Currently our PGL is developed based on static computational mode of paddle (we’ll support dynamic computational model later). We need to build model upon a virtual data holder. GraphWrapper provide a virtual graph structure that users can build deep learning models based on this virtual graph. And then feed real graph data to run the models.
```python
import paddle.fluid as fluid
use_cuda = False
place = fluid.GPUPlace(0) if use_cuda else fluid.CPUPlace()
# use GraphWrapper as a container for graph data to construct a graph neural network
gw = pgl.graph_wrapper.GraphWrapper(name='graph',
place = place,
node_feat=g.node_feat_info())
```
## Step 2: create a simple Graph Convolutional Network(GCN)
In this tutorial, we use a simple Graph Convolutional Network(GCN) developed by [Kipf and Welling](https://arxiv.org/abs/1609.02907) to perform node classification. Here we use the simplest GCN structure. If readers want to know more about GCN, you can refer to the original paper.
* In layer $l$,each node $u_i^l$ has a feature vector $h_i^l$;
* In every layer, the idea of GCN is that the feature vector $h_i^{l+1}$ of each node $u_i^{l+1}$ in the next layer are obtained by weighting the feature vectors of all the neighboring nodes and then go through a non-linear transformation.
In PGL, we can easily implement a GCN layer as follows:
```python
# define GCN layer function
def gcn_layer(gw, feature, hidden_size, name, activation):
# gw is a GraphWrapper;feature is the feature vectors of nodes
# define message function
def send_func(src_feat, dst_feat, edge_feat):
# In this tutorial, we return the feature vector of the source node as message
return src_feat['h']
# define reduce function
def recv_func(feat):
# we sum the feature vector of the source node
return fluid.layers.sequence_pool(feat, pool_type='sum')
# trigger message to passing
msg = gw.send(send_func, nfeat_list=[('h', feature)])
# recv funciton receives message and trigger reduce funcition to handle message
output = gw.recv(msg, recv_func)
output = fluid.layers.fc(output,
size=hidden_size,
bias_attr=False,
act=activation,
name=name)
return output
```
After defining the GCN layer, we can construct a deeper GCN model with two GCN layers.
```python
output = gcn_layer(gw, gw.node_feat['feature'],
hidden_size=8, name='gcn_layer_1', activation='relu')
output = gcn_layer(gw, output, hidden_size=2,
name='gcn_layer_2', activation=None)
```
## Step 3: data preprocessing
Since we implement a node binary classifier, we can use 0 and 1 to represent two classes respectively.
```python
y = [0,1,1,1,0,0,0,1,0,1]
label = np.array(y, dtype="float32")
label = np.expand_dims(label, -1)
```
## Step 4: training program
The training process of GCN is the same as that of other paddle-based models.
- First we create a loss function.
- Then we create a optimizer.
- Finally, we create a executor and train the model.
```python
# create a label layer as a container
node_label = fluid.layers.data("node_label", shape=[None, 1],
dtype="float32", append_batch_size=False)
# using cross-entropy with sigmoid layer as the loss function
loss = fluid.layers.sigmoid_cross_entropy_with_logits(x=output, label=node_label)
# calculate the mean loss
loss = fluid.layers.mean(loss)
# choose the Adam optimizer and set the learning rate to be 0.01
adam = fluid.optimizer.Adam(learning_rate=0.01)
adam.minimize(loss)
# create the executor
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
feed_dict = gw.to_feed(g) # gets graph data
for epoch in range(30):
feed_dict['node_label'] = label
train_loss = exe.run(fluid.default_main_program(),
feed=feed_dict,
fetch_list=[loss],
return_numpy=True)
print('Epoch %d | Loss: %f'%(epoch, train_loss[0]))
```
The Team
========
PGL is developed and maintained by NLP and Paddle Teams at Baidu
# PGL Examples for GAT
[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks.
### Simple example to build single head GAT
To build a gat layer, one can use our pre-defined ```pgl.layers.gat``` or just write a gat layer with message passing interface.
```python
import paddle.fluid as fluid
def gat_layer(graph_wrapper, node_feature, hidden_size):
def send_func(src_feat, dst_feat, edge_feat):
logits = src_feat["a1"] + dst_feat["a2"]
logits = fluid.layers.leaky_relu(logits, alpha=0.2)
return {"logits": logits, "h": src_feat }
def recv_func(msg):
norm = fluid.layers.sequence_softmax(msg["logits"])
output = msg["h"] * norm
return output
h = fluid.layers.fc(node_feature, hidden_size, bias_attr=False, name="hidden")
a1 = fluid.layers.fc(node_feature, 1, name="a1_weight")
a2 = fluid.layers.fc(node_feature, 1, name="a2_weight")
message = graph_wrapper.send(send_func,
nfeat_list=[("h", h), ("a1", a1), ("a2", a2)])
output = graph_wrapper.recv(recv_func, message)
return output
```
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~83% | 0.0188s | 0.0175s |
| Pubmed | ~78% | 0.0449s | 0.0295s |
| Citeseer | ~70% | 0.0275 | 0.0253s |
### How to run
For examples, use gpu to train gat on cora dataset.
```
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#-*- coding: utf-8 -*-
import pgl
from pgl import data_loader
from pgl.utils.logger import log
import paddle.fluid as fluid
import numpy as np
import time
import argparse
def load(name):
if name == 'cora':
dataset = data_loader.CoraDataset()
elif name == "pubmed":
dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
elif name == "citeseer":
dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def main(args):
dataset = load(args.dataset)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
test_program = fluid.Program()
hidden_size = 8
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.GraphWrapper(
name="graph",
place=place,
node_feat=dataset.graph.node_feat_info())
output = pgl.layers.gat(gw,
gw.node_feat["words"],
hidden_size,
activation="elu",
name="gat_layer_1",
num_heads=8,
feat_drop=0.6,
attn_drop=0.6,
is_test=False)
output = pgl.layers.gat(gw,
output,
dataset.num_classes,
num_heads=1,
activation=None,
name="gat_layer_2",
feat_drop=0.6,
attn_drop=0.6,
is_test=False)
node_index = fluid.layers.data(
"node_index",
shape=[None, 1],
dtype="int32",
append_batch_size=False)
node_label = fluid.layers.data(
"node_label",
shape=[None, 1],
dtype="int64",
append_batch_size=False)
pred = fluid.layers.gather(output, node_index)
loss, pred = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=node_label, return_softmax=True)
acc = fluid.layers.accuracy(input=pred, label=node_label, k=1)
loss = fluid.layers.mean(loss)
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(
learning_rate=0.005,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.0005))
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(startup_program)
feed_dict = gw.to_feed(dataset.graph)
train_index = dataset.train_index
train_label = np.expand_dims(dataset.y[train_index], -1)
train_index = np.expand_dims(train_index, -1)
val_index = dataset.val_index
val_label = np.expand_dims(dataset.y[val_index], -1)
val_index = np.expand_dims(val_index, -1)
test_index = dataset.test_index
test_label = np.expand_dims(dataset.y[test_index], -1)
test_index = np.expand_dims(test_index, -1)
dur = []
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
feed_dict["node_index"] = np.array(train_index, dtype="int32")
feed_dict["node_label"] = np.array(train_label, dtype="int64")
train_loss, train_acc = exe.run(train_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
feed_dict["node_index"] = np.array(val_index, dtype="int32")
feed_dict["node_label"] = np.array(val_label, dtype="int64")
val_loss, val_acc = exe.run(test_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
"Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+ "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
feed_dict["node_index"] = np.array(test_index, dtype="int32")
feed_dict["node_label"] = np.array(test_label, dtype="int64")
test_loss, test_acc = exe.run(test_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
log.info("Accuracy: %f" % test_acc)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='GAT')
parser.add_argument(
"--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
args = parser.parse_args()
log.info(args)
main(args)
# PGL Examples for GCN
[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks.
### Simple example to build GCN
To build a gcn layer, one can use our pre-defined ```pgl.layers.gcn``` or just write a gcn layer with message passing interface.
```python
import paddle.fluid as fluid
def gcn_layer(graph_wrapper, node_feature, hidden_size, act):
def send_func(src_feat, dst_feat, edge_feat):
return src_feat["h"]
def recv_func(msg):
return fluid.layers.sequence_pool(msg, "sum")
message = graph_wrapper.send(send_func, nfeat_list=[("h", node_feature)])
output = graph_wrapper.recv(recv_func, message)
output = fluid.layers.fc(output, size=hidden_size, act=act)
return output
```
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~81% | 0.0106s | 0.0104s |
| Pubmed | ~79% | 0.0210s | 0.0154s |
| Citeseer | ~71% | 0.0175s | 0.0177s |
### How to run
For examples, use gpu to train gcn on cora dataset.
```
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pgl
from pgl import data_loader
from pgl.utils.logger import log
import paddle.fluid as fluid
import numpy as np
import time
import argparse
def load(name):
if name == 'cora':
dataset = data_loader.CoraDataset()
elif name == "pubmed":
dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
elif name == "citeseer":
dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def main(args):
dataset = load(args.dataset)
# normalize
indegree = dataset.graph.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
test_program = fluid.Program()
hidden_size = 16
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.GraphWrapper(
name="graph",
place=place,
node_feat=dataset.graph.node_feat_info())
output = pgl.layers.gcn(gw,
gw.node_feat["words"],
hidden_size,
activation="relu",
norm=gw.node_feat['norm'],
name="gcn_layer_1")
output = fluid.layers.dropout(
output, 0.5, dropout_implementation='upscale_in_train')
output = pgl.layers.gcn(gw,
output,
dataset.num_classes,
activation=None,
norm=gw.node_feat['norm'],
name="gcn_layer_2")
node_index = fluid.layers.data(
"node_index",
shape=[None, 1],
dtype="int32",
append_batch_size=False)
node_label = fluid.layers.data(
"node_label",
shape=[None, 1],
dtype="int64",
append_batch_size=False)
pred = fluid.layers.gather(output, node_index)
loss, pred = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=node_label, return_softmax=True)
acc = fluid.layers.accuracy(input=pred, label=node_label, k=1)
loss = fluid.layers.mean(loss)
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(
learning_rate=1e-2,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.0005))
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(startup_program)
feed_dict = gw.to_feed(dataset.graph)
train_index = dataset.train_index
train_label = np.expand_dims(dataset.y[train_index], -1)
train_index = np.expand_dims(train_index, -1)
val_index = dataset.val_index
val_label = np.expand_dims(dataset.y[val_index], -1)
val_index = np.expand_dims(val_index, -1)
test_index = dataset.test_index
test_label = np.expand_dims(dataset.y[test_index], -1)
test_index = np.expand_dims(test_index, -1)
dur = []
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
feed_dict["node_index"] = np.array(train_index, dtype="int32")
feed_dict["node_label"] = np.array(train_label, dtype="int64")
train_loss, train_acc = exe.run(train_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
feed_dict["node_index"] = np.array(val_index, dtype="int32")
feed_dict["node_label"] = np.array(val_label, dtype="int64")
val_loss, val_acc = exe.run(test_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
"Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+ "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
feed_dict["node_index"] = np.array(test_index, dtype="int32")
feed_dict["node_label"] = np.array(test_label, dtype="int64")
test_loss, test_acc = exe.run(test_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
log.info("Accuracy: %f" % test_acc)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='GCN')
parser.add_argument(
"--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
args = parser.parse_args()
log.info(args)
main(args)
# GraphSAGE in PGL
[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.
### Datasets
The reddit dataset should be downloaded from the following links and placed in directory ```./data```. The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
- reddit.npz https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
- reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### How to run
To train a GraphSAGE model on Reddit Dataset, you can just run
```
python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry
```
#### Hyperparameters
- epoch: Number of epochs default (10)
- use_cuda: Use gpu if assign use_cuda.
- graphsage_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm".
- normalize: Normalize the input feature if assign normalize.
- sample_workers: The number of workers for multiprocessing subgraph sample.
- lr: Learning rate.
- symmetry: Make the edges symmetric if assign symmetry.
- batch_size: Batch size.
- samples_1: The max neighbors for the first hop neighbor sampling. (default: 25)
- samples_2: The max neighbors for the second hop neighbor sampling. (default: 10)
- hidden_size: The hidden size of the GraphSAGE models.
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Aggregator | Accuracy | Reported in paper |
| --- | --- | --- |
| Mean | 95.70% | 95.0% |
| Meanpool | 95.60% | 94.8% |
| Maxpool | 94.95% | 94.8% |
| LSTM | 95.13% | 95.4% |
** download the dataset **
data from https://github.com/matenure/FastGCN/issues/8
reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
import paddle.fluid as fluid
def copy_send(src_feat, dst_feat, edge_feat):
return src_feat["h"]
def mean_recv(feat):
return fluid.layers.sequence_pool(feat, pool_type="average")
def sum_recv(feat):
return fluid.layers.sequence_pool(feat, pool_type="sum")
def max_recv(feat):
return fluid.layers.sequence_pool(feat, pool_type="max")
def lstm_recv(feat):
hidden_dim = 128
forward, _ = fluid.layers.dynamic_lstm(
input=feat, size=hidden_dim * 4, use_peepholes=False)
output = fluid.layers.sequence_last_step(forward)
return output
def graphsage_mean(gw, feature, hidden_size, act, name):
msg = gw.send(copy_send, nfeat_list=[("h", feature)])
neigh_feature = gw.recv(msg, mean_recv)
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_meanpool(gw,
feature,
hidden_size,
act,
name,
inner_hidden_size=512):
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
neigh_feature = gw.recv(msg, mean_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_maxpool(gw,
feature,
hidden_size,
act,
name,
inner_hidden_size=512):
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
neigh_feature = gw.recv(msg, max_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_lstm(gw, feature, hidden_size, act, name):
inner_hidden_size = 128
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
hidden_dim = 128
forward_proj = fluid.layers.fc(input=neigh_feature,
size=hidden_dim * 4,
bias_attr=False,
name="lstm_proj")
msg = gw.send(copy_send, nfeat_list=[("h", forward_proj)])
neigh_feature = gw.recv(msg, lstm_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import pickle as pkl
import paddle
import paddle.fluid as fluid
import pgl
import time
from pgl.utils.logger import log
import train
import time
def node_batch_iter(nodes, node_label, batch_size):
perm = np.arange(len(nodes))
np.random.shuffle(perm)
start = 0
while start < len(nodes):
index = perm[start:start + batch_size]
start += batch_size
yield nodes[index], node_label[index]
def traverse(item):
if isinstance(item, list) or isinstance(item, np.ndarray):
for i in iter(item):
for j in traverse(i):
yield j
else:
yield item
def flat_node_and_edge(nodes, eids):
nodes = list(set(traverse(nodes)))
eids = list(set(traverse(eids)))
return nodes, eids
def worker(batch_info, graph, samples):
def work():
for batch_train_samples, batch_train_labels in batch_info:
start_nodes = batch_train_samples
nodes = start_nodes
eids = []
for max_deg in samples:
pred, pred_eid = graph.sample_predecessor(
start_nodes, max_degree=max_deg, return_eids=True)
last_nodes = nodes
nodes = [nodes, pred]
eids = [eids, pred_eid]
nodes, eids = flat_node_and_edge(nodes, eids)
# Find new nodes
start_nodes = list(set(nodes) - set(last_nodes))
if len(start_nodes) == 0:
break
feed_dict = {}
feed_dict["nodes"] = [int(n) for n in nodes]
feed_dict["eids"] = [int(e) for e in eids]
feed_dict["node_label"] = [int(n) for n in batch_train_labels]
feed_dict["node_index"] = [int(n) for n in batch_train_samples]
yield feed_dict
return work
def multiprocess_graph_reader(graph,
graph_wrapper,
samples,
node_index,
batch_size,
node_label,
num_workers=4):
def parse_to_subgraph(rd):
def work():
for data in rd():
nodes = data["nodes"]
eids = data["eids"]
batch_train_labels = data["node_label"]
batch_train_samples = data["node_index"]
subgraph = graph.subgraph(nodes=nodes, eid=eids)
sub_node_index = subgraph.reindex_from_parrent_nodes(
batch_train_samples)
feed_dict = graph_wrapper.to_feed(subgraph)
feed_dict["node_label"] = np.expand_dims(
np.array(
batch_train_labels, dtype="int64"), -1)
feed_dict["node_index"] = sub_node_index
yield feed_dict
return work
def reader():
batch_info = list(
node_batch_iter(
node_index, node_label, batch_size=batch_size))
block_size = int(len(batch_info) / num_workers + 1)
reader_pool = []
for i in range(num_workers):
reader_pool.append(
worker(batch_info[block_size * i:block_size * (i + 1)], graph,
samples))
multi_process_sample = paddle.reader.multiprocess_reader(
reader_pool, use_pipe=False)
r = parse_to_subgraph(multi_process_sample)
return paddle.reader.buffered(r, 1000)
return reader()
def graph_reader(graph, graph_wrapper, samples, node_index, batch_size,
node_label):
def reader():
for batch_train_samples, batch_train_labels in node_batch_iter(
node_index, node_label, batch_size=batch_size):
start_nodes = batch_train_samples
nodes = start_nodes
eids = []
for max_deg in samples:
pred, pred_eid = graph.sample_predecessor(
start_nodes, max_degree=max_deg, return_eids=True)
last_nodes = nodes
nodes = [nodes, pred]
eids = [eids, pred_eid]
nodes, eids = flat_node_and_edge(nodes, eids)
# Find new nodes
start_nodes = list(set(nodes) - set(last_nodes))
if len(start_nodes) == 0:
break
subgraph = graph.subgraph(nodes=nodes, eid=eids)
feed_dict = graph_wrapper.to_feed(subgraph)
sub_node_index = subgraph.reindex_from_parrent_nodes(
batch_train_samples)
feed_dict["node_label"] = np.expand_dims(
np.array(
batch_train_labels, dtype="int64"), -1)
feed_dict["node_index"] = np.array(sub_node_index, dtype="int32")
yield feed_dict
return paddle.reader.buffered(reader, 1000)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import numpy as np
import scipy.sparse as sp
from sklearn.preprocessing import StandardScaler
import pgl
from pgl.utils.logger import log
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import reader
from model import graphsage_mean, graphsage_meanpool,\
graphsage_maxpool, graphsage_lstm
def load_data(normalize=True, symmetry=True):
"""
data from https://github.com/matenure/FastGCN/issues/8
reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
"""
data = np.load("data/reddit.npz")
adj = sp.load_npz("data/reddit_adj.npz")
if symmetry:
adj = adj + adj.T
adj = adj.tocoo()
src = adj.row
dst = adj.col
num_class = 41
train_label = data['y_train']
val_label = data['y_val']
test_label = data['y_test']
train_index = data['train_index']
val_index = data['val_index']
test_index = data['test_index']
feature = data["feats"].astype("float32")
if normalize:
scaler = StandardScaler()
scaler.fit(feature[train_index])
feature = scaler.transform(feature)
log.info("Feature shape %s" % (repr(feature.shape)))
graph = pgl.graph.Graph(
num_nodes=feature.shape[0],
edges=list(zip(src, dst)),
node_feat={"index": np.arange(
0, len(feature), dtype="int32")})
return {
"graph": graph,
"train_index": train_index,
"train_label": train_label,
"val_label": val_label,
"val_index": val_index,
"test_index": test_index,
"test_label": test_label,
"feature": feature,
"num_class": 41
}
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size, feature):
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int32", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
feature = fluid.layers.gather(feature, graph_wrapper.node_feat['index'])
feature.stop_gradient = True
for i in range(k_hop):
if graphsage_type == 'graphsage_mean':
feature = graphsage_mean(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_mean_%s % i")
elif graphsage_type == 'graphsage_meanpool':
feature = graphsage_meanpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_meanpool_%s % i")
elif graphsage_type == 'graphsage_maxpool':
feature = graphsage_maxpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s % i")
elif graphsage_type == 'graphsage_lstm':
feature = graphsage_lstm(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s % i")
else:
raise ValueError("graphsage type %s is not"
" implemented" % graphsage_type)
feature = fluid.layers.gather(feature, node_index)
logits = fluid.layers.fc(feature,
num_class,
act=None,
name='classification_layer')
proba = fluid.layers.softmax(logits)
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=node_label)
loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
return loss, acc
def run_epoch(batch_iter,
exe,
program,
prefix,
model_loss,
model_acc,
epoch,
log_per_step=100):
batch = 0
total_loss = 0.
total_acc = 0.
total_sample = 0
start = time.time()
for batch_feed_dict in batch_iter():
batch += 1
batch_loss, batch_acc = exe.run(program,
fetch_list=[model_loss, model_acc],
feed=batch_feed_dict)
if batch % log_per_step == 0:
log.info("Batch %s %s-Loss %s %s-Acc %s" %
(batch, prefix, batch_loss, prefix, batch_acc))
num_samples = len(batch_feed_dict["node_index"])
total_loss += batch_loss * num_samples
total_acc += batch_acc * num_samples
total_sample += num_samples
end = time.time()
log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
(prefix, epoch, total_loss / total_sample,
total_acc / total_sample, (end - start) / batch))
def main(args):
data = load_data(args.normalize, args.symmetry)
log.info("preprocess finish")
log.info("Train Examples: %s" % len(data["train_index"]))
log.info("Val Examples: %s" % len(data["val_index"]))
log.info("Test Examples: %s" % len(data["test_index"]))
log.info("Num nodes %s" % data["graph"].num_nodes)
log.info("Num edges %s" % data["graph"].num_edges)
log.info("Average Degree %s" % np.mean(data["graph"].indegree()))
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
samples = []
if args.samples_1 > 0:
samples.append(args.samples_1)
if args.samples_2 > 0:
samples.append(args.samples_2)
with fluid.program_guard(train_program, startup_program):
feature, feature_init = paddle_helper.constant(
"feat",
dtype=data['feature'].dtype,
value=data['feature'],
hide_batch_size=False)
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph", place, node_feat=data['graph'].node_feat_info())
model_loss, model_acc = build_graph_model(
graph_wrapper,
num_class=data["num_class"],
feature=feature,
hidden_size=args.hidden_size,
graphsage_type=args.graphsage_type,
k_hop=len(samples))
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=args.lr)
adam.minimize(model_loss)
exe = fluid.Executor(place)
exe.run(startup_program)
feature_init(place)
if args.sample_workers > 1:
train_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
else:
train_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
if args.sample_workers > 1:
val_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
else:
val_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
if args.sample_workers > 1:
test_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
else:
test_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
for epoch in range(args.epoch):
run_epoch(
train_iter,
program=train_program,
exe=exe,
prefix="train",
model_loss=model_loss,
model_acc=model_acc,
epoch=epoch)
run_epoch(
val_iter,
program=test_program,
exe=exe,
prefix="val",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
run_epoch(
test_iter,
program=test_program,
prefix="test",
exe=exe,
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='graphsage')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
"--normalize", action='store_true', help="normalize features")
parser.add_argument(
"--symmetry", action='store_true', help="undirect graph")
parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
parser.add_argument("--sample_workers", type=int, default=5)
parser.add_argument("--epoch", type=int, default=10)
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--lr", type=float, default=0.01)
parser.add_argument("--samples_1", type=int, default=25)
parser.add_argument("--samples_2", type=int, default=10)
args = parser.parse_args()
log.info(args)
main(args)
# PGL Examples for node2vec
[Node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce node2vec algorithms and reach the same level of indicators as the paper.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3) and [Arxiv](http://snap.stanford.edu/data/ca-AstroPh.html).
## Dependencies
- paddlepaddle>=1.4
- pgl
## How to run
For examples, use gpu to train gcn on cora dataset.
```sh
# multiclass task example
python node2vec.py --use_cuda --dataset BlogCatalog --save_path ./tmp/node2vec_BlogCatalog/ --offline_learning --epoch 400
python multi_class.py --use_cuda --ckpt_path ./tmp/node2vec_BlogCatalog/paddle_model --epoch 1000
# link prediction task example
python node2vec.py --use_cuda --dataset ArXiv --save_path
./tmp/node2vec_ArXiv --offline_learning --epoch 10
python link_predict.py --use_cuda --ckpt_path ./tmp/node2vec_ArXiv/paddle_model --epoch 400
```
## Hyperparameters
- dataset: The citation dataset "BlogCatalog" and "ArXiv".
- use_cuda: Use gpu if assign use_cuda.
### Experiment results
Dataset|model|Task|Metric|PGL Result|Reported Result
--|--|--|--|--|--
BlogCatalog|deepwalk|multi-label classification|MacroF1|0.250|0.211
BlogCatalog|node2vec|multi-label classification|MacroF1|0.262|0.258
ArXiv|deepwalk|link prediction|AUC|0.9538|0.9340
ArXiv|node2vec|link prediction|AUC|0.9541|0.9366
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import math
import os
import numpy as np
from sklearn import metrics
import pgl
from pgl import data_loader
from pgl.utils import op
from pgl.utils.logger import log
import paddle.fluid as fluid
import paddle.fluid.layers as l
np.random.seed(123)
def load(name):
if name == 'BlogCatalog':
dataset = data_loader.BlogCatalogDataset()
elif name == "ArXiv":
dataset = data_loader.ArXivDataset()
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def binary_op(u_embed, v_embed, binary_op_type):
if binary_op_type == "Average":
edge_embed = (u_embed + v_embed) / 2
elif binary_op_type == "Hadamard":
edge_embed = u_embed * v_embed
elif binary_op_type == "Weighted-L1":
edge_embed = l.abs(u_embed - v_embed)
elif binary_op_type == "Weighted-L2":
edge_embed = (u_embed - v_embed) * (u_embed - v_embed)
else:
raise ValueError(binary_op_type + " binary_op_type doesn't exists")
return edge_embed
def link_predict_model(num_nodes,
hidden_size=16,
name='link_predict_task',
binary_op_type="Weighted-L2"):
pyreader = l.py_reader(
capacity=70,
shapes=[[-1, 1], [-1, 1], [-1, 1]],
dtypes=['int64', 'int64', 'int64'],
lod_levels=[0, 0, 0],
name=name + '_pyreader',
use_double_buffer=True)
u, v, label = l.read_file(pyreader)
u_embed = l.embedding(
input=u,
size=[num_nodes, hidden_size],
param_attr=fluid.ParamAttr(name='content'))
v_embed = l.embedding(
input=v,
size=[num_nodes, hidden_size],
param_attr=fluid.ParamAttr(name='content'))
u_embed.stop_gradient = True
v_embed.stop_gradient = True
edge_embed = binary_op(u_embed, v_embed, binary_op_type)
logit = l.fc(input=edge_embed, size=1)
loss = l.sigmoid_cross_entropy_with_logits(logit, l.cast(label, 'float32'))
loss = l.reduce_mean(loss)
prob = l.sigmoid(logit)
return pyreader, loss, prob, label
def link_predict_generator(pos_edges,
neg_edges,
batch_size=512,
epoch=2000,
shuffle=True):
all_edges = []
for (u, v) in pos_edges:
all_edges.append([u, v, 1])
for (u, v) in neg_edges:
all_edges.append([u, v, 0])
all_edges = np.array(all_edges, np.int64)
def batch_edges_generator(shuffle=shuffle):
perm = np.arange(len(all_edges), dtype=np.int64)
if shuffle:
np.random.shuffle(perm)
start = 0
while start < len(all_edges):
yield all_edges[perm[start:start + batch_size]]
start += batch_size
def wrapper():
for _ in range(epoch):
for batch_edges in batch_edges_generator():
yield batch_edges.T[0:1].T, batch_edges.T[
1:2].T, batch_edges.T[2:3].T
return wrapper
def main(args):
hidden_size = args.hidden_size
epoch = args.epoch
ckpt_path = args.ckpt_path
dataset = load(args.dataset)
num_edges = len(dataset.pos_edges) + len(dataset.neg_edges)
train_num_edges = int(len(dataset.pos_edges) * 0.5) + int(
len(dataset.neg_edges) * 0.5)
test_num_edges = num_edges - train_num_edges
train_steps = (train_num_edges // train_num_edges) * epoch
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_prog = fluid.Program()
test_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
with fluid.unique_name.guard():
train_pyreader, train_loss, train_probs, train_labels = link_predict_model(
dataset.graph.num_nodes, hidden_size=hidden_size, name='train')
lr = l.polynomial_decay(0.025, train_steps, 0.0001)
adam = fluid.optimizer.Adam(lr)
adam.minimize(train_loss)
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, test_loss, test_probs, test_labels = link_predict_model(
dataset.graph.num_nodes, hidden_size=hidden_size, name='test')
test_prog = test_prog.clone(for_test=True)
train_pyreader.decorate_tensor_provider(
link_predict_generator(
dataset.pos_edges[:train_num_edges // 2],
dataset.neg_edges[:train_num_edges // 2],
batch_size=train_num_edges,
epoch=epoch))
test_pyreader.decorate_tensor_provider(
link_predict_generator(
dataset.pos_edges[train_num_edges // 2:],
dataset.neg_edges[train_num_edges // 2:],
batch_size=test_num_edges,
epoch=1))
exe = fluid.Executor(place)
exe.run(startup_prog)
train_pyreader.start()
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(ckpt_path, var.name))
fluid.io.load_vars(
exe, ckpt_path, main_program=train_prog, predicate=existed_params)
step = 0
prev_time = time.time()
while 1:
try:
train_loss_val, train_probs_val, train_labels_val = exe.run(
train_prog,
fetch_list=[train_loss, train_probs, train_labels],
return_numpy=True)
fpr, tpr, thresholds = metrics.roc_curve(train_labels_val,
train_probs_val)
train_auc = metrics.auc(fpr, tpr)
step += 1
log.info("Step %d " % step + "Train Loss: %f " % train_loss_val +
"Train AUC: %f " % train_auc)
except fluid.core.EOFException:
train_pyreader.reset()
break
test_pyreader.start()
test_probs_vals, test_labels_vals = [], []
while 1:
try:
test_loss_val, test_probs_val, test_labels_val = exe.run(
test_prog,
fetch_list=[test_loss, test_probs, test_labels],
return_numpy=True)
test_probs_vals.append(
test_probs_val), test_labels_vals.append(test_labels_val)
except fluid.core.EOFException:
test_pyreader.reset()
test_probs_array = np.concatenate(test_probs_vals)
test_labels_array = np.concatenate(test_labels_vals)
fpr, tpr, thresholds = metrics.roc_curve(test_labels_array,
test_probs_array)
test_auc = metrics.auc(fpr, tpr)
log.info("\t\tStep %d " % step + "Test Loss: %f " %
test_loss_val + "Test AUC: %f " % test_auc)
break
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='node2vec')
parser.add_argument(
"--dataset",
type=str,
default="ArXiv",
help="dataset (BlogCatalog, ArXiv)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--epoch", type=int, default=400)
parser.add_argument("--batch_size", type=int, default=None)
parser.add_argument(
"--ckpt_path",
type=str,
default="./tmp/deepwalk_arxiv_e10/paddle_model")
args = parser.parse_args()
log.info(args)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import math
import os
import numpy as np
import sklearn.metrics
from sklearn.metrics import f1_score
import pgl
from pgl import data_loader
from pgl.utils import op
from pgl.utils.logger import log
import paddle.fluid as fluid
import paddle.fluid.layers as l
np.random.seed(123)
def load(name):
if name == 'BlogCatalog':
dataset = data_loader.BlogCatalogDataset()
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def node_classify_model(graph,
num_labels,
hidden_size=16,
name='node_classify_task'):
pyreader = l.py_reader(
capacity=70,
shapes=[[-1, 1], [-1, num_labels]],
dtypes=['int64', 'float32'],
lod_levels=[0, 0],
name=name + '_pyreader',
use_double_buffer=True)
nodes, labels = l.read_file(pyreader)
embed_nodes = l.embedding(
input=nodes,
size=[graph.num_nodes, hidden_size],
param_attr=fluid.ParamAttr(name='content'))
embed_nodes.stop_gradient = True
logits = l.fc(input=embed_nodes, size=num_labels)
loss = l.sigmoid_cross_entropy_with_logits(logits, labels)
loss = l.reduce_mean(loss)
prob = l.sigmoid(logits)
topk = l.reduce_sum(labels, -1)
return pyreader, loss, prob, labels, topk
def node_classify_generator(graph,
all_nodes=None,
batch_size=512,
epoch=1,
shuffle=True):
if all_nodes is None:
all_nodes = np.arange(graph.num_nodes)
#labels = (np.random.rand(512, 39) > 0.95).astype(np.float32)
def batch_nodes_generator(shuffle=shuffle):
perm = np.arange(len(all_nodes), dtype=np.int64)
if shuffle:
np.random.shuffle(perm)
start = 0
while start < len(all_nodes):
yield all_nodes[perm[start:start + batch_size]]
start += batch_size
def wrapper():
for _ in range(epoch):
for batch_nodes in batch_nodes_generator():
batch_nodes_expanded = np.expand_dims(batch_nodes,
-1).astype(np.int64)
batch_labels = graph.node_feat['group_id'][batch_nodes].astype(
np.float32)
yield [batch_nodes_expanded, batch_labels]
return wrapper
def topk_f1_score(labels,
probs,
topk_list=None,
average="macro",
threshold=None):
assert topk_list is not None or threshold is not None, "one of topklist and threshold should not be None"
if threshold is not None:
preds = probs > threshold
else:
preds = np.zeros_like(labels, dtype=np.int64)
for idx, (prob, topk) in enumerate(zip(np.argsort(probs), topk_list)):
preds[idx][prob[-int(topk):]] = 1
return f1_score(labels, preds, average=average)
def main(args):
hidden_size = args.hidden_size
epoch = args.epoch
ckpt_path = args.ckpt_path
threshold = args.threshold
dataset = load(args.dataset)
if args.batch_size is None:
batch_size = len(dataset.train_index)
else:
batch_size = args.batch_size
train_steps = (len(dataset.train_index) // batch_size) * epoch
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_prog = fluid.Program()
test_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
with fluid.unique_name.guard():
train_pyreader, train_loss, train_probs, train_labels, train_topk = node_classify_model(
dataset.graph,
dataset.num_groups,
hidden_size=hidden_size,
name='train')
lr = l.polynomial_decay(0.025, train_steps, 0.0001)
adam = fluid.optimizer.Adam(lr)
adam.minimize(train_loss)
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, test_loss, test_probs, test_labels, test_topk = node_classify_model(
dataset.graph,
dataset.num_groups,
hidden_size=hidden_size,
name='test')
test_prog = test_prog.clone(for_test=True)
exe = fluid.Executor(place)
exe.run(startup_prog)
train_pyreader.decorate_tensor_provider(
node_classify_generator(
dataset.graph,
dataset.train_index,
batch_size=batch_size,
epoch=epoch))
test_pyreader.decorate_tensor_provider(
node_classify_generator(
dataset.graph, dataset.test_index, batch_size=batch_size, epoch=1))
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(ckpt_path, var.name))
fluid.io.load_vars(
exe, ckpt_path, main_program=train_prog, predicate=existed_params)
step = 0
prev_time = time.time()
train_pyreader.start()
while 1:
try:
train_loss_val, train_probs_val, train_labels_val, train_topk_val = exe.run(
train_prog,
fetch_list=[
train_loss, train_probs, train_labels, train_topk
],
return_numpy=True)
train_macro_f1 = topk_f1_score(train_labels_val, train_probs_val,
train_topk_val, "macro", threshold)
train_micro_f1 = topk_f1_score(train_labels_val, train_probs_val,
train_topk_val, "micro", threshold)
step += 1
log.info("Step %d " % step + "Train Loss: %f " % train_loss_val +
"Train Macro F1: %f " % train_macro_f1 +
"Train Micro F1: %f " % train_micro_f1)
except fluid.core.EOFException:
train_pyreader.reset()
break
test_pyreader.start()
test_probs_vals, test_labels_vals, test_topk_vals = [], [], []
while 1:
try:
test_loss_val, test_probs_val, test_labels_val, test_topk_val = exe.run(
test_prog,
fetch_list=[
test_loss, test_probs, test_labels, test_topk
],
return_numpy=True)
test_probs_vals.append(
test_probs_val), test_labels_vals.append(test_labels_val)
test_topk_vals.append(test_topk_val)
except fluid.core.EOFException:
test_pyreader.reset()
test_probs_array = np.concatenate(test_probs_vals)
test_labels_array = np.concatenate(test_labels_vals)
test_topk_array = np.concatenate(test_topk_vals)
test_macro_f1 = topk_f1_score(
test_labels_array, test_probs_array, test_topk_array,
"macro", threshold)
test_micro_f1 = topk_f1_score(
test_labels_array, test_probs_array, test_topk_array,
"micro", threshold)
log.info("\t\tStep %d " % step + "Test Loss: %f " %
test_loss_val + "Test Macro F1: %f " % test_macro_f1 +
"Test Micro F1: %f " % test_micro_f1)
break
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='node2vec')
parser.add_argument(
"--dataset",
type=str,
default="BlogCatalog",
help="dataset (BlogCatalog)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--epoch", type=int, default=400)
parser.add_argument("--batch_size", type=int, default=None)
parser.add_argument("--threshold", type=float, default=0.3)
parser.add_argument(
"--ckpt_path",
type=str,
default="./tmp/baseline_node2vec/paddle_model")
args = parser.parse_args()
log.info(args)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import math
import os
import io
from multiprocessing import Pool
import glob
import numpy as np
import sklearn.metrics
from sklearn.metrics import f1_score
import pgl
from pgl import data_loader
from pgl.utils import op
from pgl.utils.logger import log
import paddle.fluid as fluid
import paddle.fluid.layers as l
def load(name):
if name == "BlogCatalog":
dataset = data_loader.BlogCatalogDataset()
elif name == "ArXiv":
dataset = data_loader.ArXivDataset()
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def node2vec_model(graph, hidden_size=16, neg_num=5):
pyreader = l.py_reader(
capacity=70,
shapes=[[-1, 1, 1], [-1, 1, 1], [-1, neg_num, 1]],
dtypes=['int64', 'int64', 'int64'],
lod_levels=[0, 0, 0],
name='train',
use_double_buffer=True)
embed_init = fluid.initializer.UniformInitializer(low=-1.0, high=1.0)
weight_init = fluid.initializer.TruncatedNormal(scale=1.0 /
math.sqrt(hidden_size))
src, pos, negs = l.read_file(pyreader)
embed_src = l.embedding(
input=src,
size=[graph.num_nodes, hidden_size],
param_attr=fluid.ParamAttr(
name='content', initializer=embed_init))
weight_pos = l.embedding(
input=pos,
size=[graph.num_nodes, hidden_size],
param_attr=fluid.ParamAttr(
name='weight', initializer=weight_init))
weight_negs = l.embedding(
input=negs,
size=[graph.num_nodes, hidden_size],
param_attr=fluid.ParamAttr(
name='weight', initializer=weight_init))
pos_logits = l.matmul(
embed_src, weight_pos, transpose_y=True) # [batch_size, 1, 1]
neg_logits = l.matmul(
embed_src, weight_negs, transpose_y=True) # [batch_size, 1, neg_num]
ones_label = pos_logits * 0. + 1.
ones_label.stop_gradient = True
pos_loss = l.sigmoid_cross_entropy_with_logits(pos_logits, ones_label)
zeros_label = neg_logits * 0.
zeros_label.stop_gradient = True
neg_loss = l.sigmoid_cross_entropy_with_logits(neg_logits, zeros_label)
loss = (l.reduce_mean(pos_loss) + l.reduce_mean(neg_loss)) / 2
return pyreader, loss
def gen_pair(walks, left_win_size=2, right_win_size=2):
src = []
pos = []
for walk in walks:
for left_offset in range(1, left_win_size + 1):
src.extend(walk[left_offset:])
pos.extend(walk[:-left_offset])
for right_offset in range(1, right_win_size + 1):
src.extend(walk[:-right_offset])
pos.extend(walk[right_offset:])
src, pos = np.array(src, dtype=np.int64), np.array(pos, dtype=np.int64)
src, pos = np.expand_dims(src, -1), np.expand_dims(pos, -1)
src, pos = np.expand_dims(src, -1), np.expand_dims(pos, -1)
return src, pos
def node2vec_generator(graph,
batch_size=512,
walk_len=5,
p=0.25,
q=0.25,
win_size=2,
neg_num=5,
epoch=200,
filelist=None):
def walks_generator():
if filelist is not None:
bucket = []
for filename in filelist:
with io.open(filename) as inf:
for line in inf:
walk = [int(x) for x in line.strip('\n').split(' ')]
bucket.append(walk)
if len(bucket) == batch_size:
yield bucket
bucket = []
if len(bucket):
yield bucket
else:
for _ in range(epoch):
for nodes in graph.node_batch_iter(batch_size):
walks = graph.node2vec_random_walk(nodes, walk_len, p, q)
yield walks
def wrapper():
for walks in walks_generator():
src, pos = gen_pair(walks, win_size, win_size)
if src.shape[0] == 0:
continue
negs = graph.sample_nodes([len(src), neg_num, 1]).astype(np.int64)
yield [src, pos, negs]
return wrapper
def process(args):
idx, graph, save_path, epoch, batch_size, walk_len, p, q, seed = args
with open('%s/%s' % (save_path, idx), 'w') as outf:
for _ in range(epoch):
np.random.seed(seed)
for nodes in graph.node_batch_iter(batch_size):
walks = graph.node2vec_random_walk(nodes, walk_len, p, q)
for walk in walks:
outf.write(' '.join([str(token) for token in walk]) + '\n')
def main(args):
hidden_size = args.hidden_size
neg_num = args.neg_num
epoch = args.epoch
p = args.p
q = args.q
save_path = args.save_path
batch_size = args.batch_size
walk_len = args.walk_len
win_size = args.win_size
if not os.path.isdir(save_path):
os.makedirs(save_path)
dataset = load(args.dataset)
if args.offline_learning:
log.info("Start random walk on disk...")
walk_save_path = os.path.join(save_path, "walks")
if not os.path.isdir(walk_save_path):
os.makedirs(walk_save_path)
pool = Pool(args.processes)
args_list = [(x, dataset.graph, walk_save_path, 1, batch_size,
walk_len, p, q, np.random.randint(2**32))
for x in range(epoch)]
pool.map(process, args_list)
filelist = glob.glob(os.path.join(walk_save_path, "*"))
log.info("Random walk on disk Done.")
else:
filelist = None
train_steps = int(dataset.graph.num_nodes / batch_size) * epoch
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
node2vec_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(node2vec_prog, startup_prog):
with fluid.unique_name.guard():
node2vec_pyreader, node2vec_loss = node2vec_model(
dataset.graph, hidden_size=hidden_size, neg_num=neg_num)
lr = l.polynomial_decay(0.025, train_steps, 0.0001)
adam = fluid.optimizer.Adam(lr)
adam.minimize(node2vec_loss)
node2vec_pyreader.decorate_tensor_provider(
node2vec_generator(
dataset.graph,
batch_size=batch_size,
walk_len=walk_len,
win_size=win_size,
epoch=epoch,
neg_num=neg_num,
p=p,
q=q,
filelist=filelist))
node2vec_pyreader.start()
exe = fluid.Executor(place)
exe.run(startup_prog)
prev_time = time.time()
step = 0
while 1:
try:
node2vec_loss_val = exe.run(node2vec_prog,
fetch_list=[node2vec_loss],
return_numpy=True)[0]
cur_time = time.time()
use_time = cur_time - prev_time
prev_time = cur_time
step += 1
log.info("Step %d " % step + "Node2vec Loss: %f " %
node2vec_loss_val + " %f s/step." % use_time)
except fluid.core.EOFException:
node2vec_pyreader.reset()
break
fluid.io.save_persistables(exe,
os.path.join(save_path, "paddle_model"),
node2vec_prog)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='node2vec')
parser.add_argument(
"--dataset",
type=str,
default="BlogCatalog",
help="dataset (BlogCatalog, ArXiv)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
"--offline_learning", action='store_true', help="use_cuda")
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--neg_num", type=int, default=20)
parser.add_argument("--epoch", type=int, default=100)
parser.add_argument("--batch_size", type=int, default=1024)
parser.add_argument("--walk_len", type=int, default=40)
parser.add_argument("--win_size", type=int, default=10)
parser.add_argument("--p", type=float, default=0.25)
parser.add_argument("--q", type=float, default=0.25)
parser.add_argument("--save_path", type=str, default="./tmp/node2vec")
parser.add_argument("--processes", type=int, default=10)
args = parser.parse_args()
log.info(args)
main(args)
# PGL Examples for GAT with StaticGraphWrapper
[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks.
However, different from the reproduction in **examples/gat**, we use `pgl.graph_wrapper.StaticGraphWrapper` to preload the graph data into gpu or cpu memories which achieves better performance on speed.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gat | Improvement |
| --- | --- | --- |---| --- | --- |
| Cora | ~83% | 0.0145s | 0.0119s | 0.0175s | 1.47x |
| Pubmed | ~78% | 0.0352s | 0.0193s |0.0295s | 1.53x |
| Citeseer | ~70% | 0.0148s | 0.0124s |0.0253s | 2.04x |
### How to run
For examples, use gpu to train gat on cora dataset.
```sh
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pgl
from pgl import data_loader
from pgl.utils import paddle_helper
from pgl.utils.logger import log
import paddle.fluid as fluid
import numpy as np
import time
import argparse
def load(name):
if name == 'cora':
dataset = data_loader.CoraDataset()
elif name == "pubmed":
dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
elif name == "citeseer":
dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def main(args):
dataset = load(args.dataset)
train_index = dataset.train_index
train_label = np.expand_dims(dataset.y[train_index], -1)
train_index = np.expand_dims(train_index, -1)
val_index = dataset.val_index
val_label = np.expand_dims(dataset.y[val_index], -1)
val_index = np.expand_dims(val_index, -1)
test_index = dataset.test_index
test_label = np.expand_dims(dataset.y[test_index], -1)
test_index = np.expand_dims(test_index, -1)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
test_program = fluid.Program()
hidden_size = 16
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.StaticGraphWrapper(
name="graph", graph=dataset.graph, place=place)
output = pgl.layers.gat(gw,
gw.node_feat["words"],
hidden_size,
activation="elu",
name="gat_layer_1",
num_heads=8,
feat_drop=0.6,
attn_drop=0.6,
is_test=False)
output = pgl.layers.gat(gw,
output,
dataset.num_classes,
num_heads=1,
activation=None,
name="gat_layer_2",
feat_drop=0.6,
attn_drop=0.6,
is_test=False)
val_program = train_program.clone(for_test=True)
test_program = train_program.clone(for_test=True)
initializer = []
with fluid.program_guard(train_program, startup_program):
train_node_index, init = paddle_helper.constant(
"train_node_index", dtype="int32", value=train_index)
initializer.append(init)
train_node_label, init = paddle_helper.constant(
"train_node_label", dtype="int64", value=train_label)
initializer.append(init)
pred = fluid.layers.gather(output, train_node_index)
train_loss_t = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=train_node_label)
train_loss_t = fluid.layers.reduce_mean(train_loss_t)
adam = fluid.optimizer.Adam(
learning_rate=1e-2,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.0005))
adam.minimize(train_loss_t)
with fluid.program_guard(val_program, startup_program):
val_node_index, init = paddle_helper.constant(
"val_node_index", dtype="int32", value=val_index)
initializer.append(init)
val_node_label, init = paddle_helper.constant(
"val_node_label", dtype="int64", value=val_label)
initializer.append(init)
pred = fluid.layers.gather(output, val_node_index)
val_loss_t, pred = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=val_node_label, return_softmax=True)
val_acc_t = fluid.layers.accuracy(
input=pred, label=val_node_label, k=1)
val_loss_t = fluid.layers.reduce_mean(val_loss_t)
with fluid.program_guard(test_program, startup_program):
test_node_index, init = paddle_helper.constant(
"test_node_index", dtype="int32", value=test_index)
initializer.append(init)
test_node_label, init = paddle_helper.constant(
"test_node_label", dtype="int64", value=test_label)
initializer.append(init)
pred = fluid.layers.gather(output, test_node_index)
test_loss_t, pred = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=test_node_label, return_softmax=True)
test_acc_t = fluid.layers.accuracy(
input=pred, label=test_node_label, k=1)
test_loss_t = fluid.layers.reduce_mean(test_loss_t)
exe = fluid.Executor(place)
exe.run(startup_program)
gw.initialize(place)
for init in initializer:
init(place)
dur = []
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
train_loss = exe.run(train_program,
feed={},
fetch_list=[train_loss_t],
return_numpy=True)
train_loss = train_loss[0]
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
val_loss, val_acc = exe.run(val_program,
feed={},
fetch_list=[val_loss_t, val_acc_t],
return_numpy=True)
log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(
dur) + "Train Loss: %f " % train_loss + "Val Loss: %f " % val_loss
+ "Val Acc: %f " % val_acc)
test_loss, test_acc = exe.run(test_program,
feed={},
fetch_list=[test_loss_t, test_acc_t],
return_numpy=True)
log.info("Accuracy: %f" % test_acc)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='GCN')
parser.add_argument(
"--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
args = parser.parse_args()
log.info(args)
main(args)
# PGL Examples for GCN with StaticGraphWrapper
[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks.
However, different from the reproduction in **examples/gcn**, we use `pgl.graph_wrapper.StaticGraphWrapper` to preload the graph data into gpu or cpu memories which achieves better performance on speed.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gcn | Improvement |
| --- | --- | --- |---| --- | --- |
| Cora | ~81% | 0.0053s | 0.0047s | 0.0104s | 2.21x |
| Pubmed | ~79% | 0.0105s | 0.0049s |0.0154s | 3.14x |
| Citeseer | ~71% | 0.0051s | 0.0045s |0.0177s | 3.93x |
### How to run
For examples, use gpu to train gcn on cora dataset.
```sh
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import pgl
from pgl import data_loader
from pgl.utils import paddle_helper
from pgl.utils.logger import log
import paddle.fluid as fluid
import numpy as np
import time
import argparse
def load(name):
if name == 'cora':
dataset = data_loader.CoraDataset()
elif name == "pubmed":
dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
elif name == "citeseer":
dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def main(args):
dataset = load(args.dataset)
# normalize
indegree = dataset.graph.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
train_index = dataset.train_index
train_label = np.expand_dims(dataset.y[train_index], -1)
train_index = np.expand_dims(train_index, -1)
val_index = dataset.val_index
val_label = np.expand_dims(dataset.y[val_index], -1)
val_index = np.expand_dims(val_index, -1)
test_index = dataset.test_index
test_label = np.expand_dims(dataset.y[test_index], -1)
test_index = np.expand_dims(test_index, -1)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
test_program = fluid.Program()
hidden_size = 16
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.StaticGraphWrapper(
name="graph", graph=dataset.graph, place=place)
output = pgl.layers.gcn(gw,
gw.node_feat["words"],
hidden_size,
activation="relu",
norm=gw.node_feat['norm'],
name="gcn_layer_1")
output = fluid.layers.dropout(
output, 0.5, dropout_implementation='upscale_in_train')
output = pgl.layers.gcn(gw,
output,
dataset.num_classes,
activation=None,
norm=gw.node_feat['norm'],
name="gcn_layer_2")
val_program = train_program.clone(for_test=True)
test_program = train_program.clone(for_test=True)
initializer = []
with fluid.program_guard(train_program, startup_program):
train_node_index, init = paddle_helper.constant(
"train_node_index", dtype="int32", value=train_index)
initializer.append(init)
train_node_label, init = paddle_helper.constant(
"train_node_label", dtype="int64", value=train_label)
initializer.append(init)
pred = fluid.layers.gather(output, train_node_index)
train_loss_t = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=train_node_label)
train_loss_t = fluid.layers.reduce_mean(train_loss_t)
adam = fluid.optimizer.Adam(
learning_rate=1e-2,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.0005))
adam.minimize(train_loss_t)
with fluid.program_guard(val_program, startup_program):
val_node_index, init = paddle_helper.constant(
"val_node_index", dtype="int32", value=val_index)
initializer.append(init)
val_node_label, init = paddle_helper.constant(
"val_node_label", dtype="int64", value=val_label)
initializer.append(init)
pred = fluid.layers.gather(output, val_node_index)
val_loss_t, pred = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=val_node_label, return_softmax=True)
val_acc_t = fluid.layers.accuracy(
input=pred, label=val_node_label, k=1)
val_loss_t = fluid.layers.reduce_mean(val_loss_t)
with fluid.program_guard(test_program, startup_program):
test_node_index, init = paddle_helper.constant(
"test_node_index", dtype="int32", value=test_index)
initializer.append(init)
test_node_label, init = paddle_helper.constant(
"test_node_label", dtype="int64", value=test_label)
initializer.append(init)
pred = fluid.layers.gather(output, test_node_index)
test_loss_t, pred = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=test_node_label, return_softmax=True)
test_acc_t = fluid.layers.accuracy(
input=pred, label=test_node_label, k=1)
test_loss_t = fluid.layers.reduce_mean(test_loss_t)
exe = fluid.Executor(place)
exe.run(startup_program)
gw.initialize(place)
for init in initializer:
init(place)
dur = []
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
train_loss = exe.run(train_program,
feed={},
fetch_list=[train_loss_t],
return_numpy=True)
train_loss = train_loss[0]
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
val_loss, val_acc = exe.run(val_program,
feed={},
fetch_list=[val_loss_t, val_acc_t],
return_numpy=True)
log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(
dur) + "Train Loss: %f " % train_loss + "Val Loss: %f " % val_loss
+ "Val Acc: %f " % val_acc)
test_loss, test_acc = exe.run(test_program,
feed={},
fetch_list=[test_loss_t, test_acc_t],
return_numpy=True)
log.info("Accuracy: %f" % test_acc)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='GCN')
parser.add_argument(
"--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
args = parser.parse_args()
log.info(args)
main(args)
graph_kernel.cpp
graph_kernel.*.so
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Generate pgl apis
"""
__version__ = "0.1.0.beta"
from pgl import layers
from pgl import graph_wrapper
from pgl import graph
from pgl import data_loader
此差异已折叠。
此差异已折叠。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
\ No newline at end of file
此差异已折叠。
此差异已折叠。
此差异已折叠。
This directory contains the a selection of the Cora dataset (www.research.whizbang.com/data).
The Cora dataset consists of Machine Learning papers. These papers are classified into one of the following seven classes:
Case_Based
Genetic_Algorithms
Neural_Networks
Probabilistic_Methods
Reinforcement_Learning
Rule_Learning
Theory
The papers were selected in a way such that in the final corpus every paper cites or is cited by atleast one other paper. There are 2708 papers in the whole corpus.
After stemming and removing stopwords we were left with a vocabulary of size 1433 unique words. All words with document frequency less than 10 were removed.
THE DIRECTORY CONTAINS TWO FILES:
The .content file contains descriptions of the papers in the following format:
<paper_id> <word_attributes>+ <class_label>
The first entry in each line contains the unique string ID of the paper followed by binary values indicating whether each word in the vocabulary is present (indicated by 1) or absent (indicated by 0) in the paper. Finally, the last entry in the line contains the class label of the paper.
The .cites file contains the citation graph of the corpus. Each line describes a link in the following format:
<ID of cited paper> <ID of citing paper>
Each line contains two paper IDs. The first entry is the ID of the paper being cited and the second ID stands for the paper which contains the citation. The direction of the link is from right to left. If a line is represented by "paper1 paper2" then the link is "paper2->paper1".
\ No newline at end of file
此差异已折叠。
此差异已折叠。
此差异已折叠。
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This package implements some benchmark dataset for graph network
and node representation learning.
"""
import os
import io
import sys
import numpy as np
import pickle as pkl
import networkx as nx
from pgl import graph
from pgl.utils.logger import log
__all__ = [
"CitationDataset",
"CoraDataset",
"ArXivDataset",
"BlogCatalogDataset",
]
def get_default_data_dir(name):
"""Get data path name"""
dir_path = os.path.abspath(os.path.dirname(__file__))
dir_path = os.path.join(dir_path, 'data')
filepath = os.path.join(dir_path, name)
return filepath
def _pickle_load(pkl_file):
"""Load pickle"""
if sys.version_info > (3, 0):
return pkl.load(pkl_file, encoding='latin1')
else:
return pkl.load(pkl_file)
def _parse_index_file(filename):
"""Parse index file."""
index = []
for line in open(filename):
index.append(int(line.strip()))
return index
class CitationDataset(object):
"""Citation dataset helps to create data for citation dataset (Pubmed and Citeseer)
Args:
name: The name for the dataset ("pubmed" or "citeseer")
symmetry_edges: Whether to create symmetry edges.
self_loop: Whether to contain self loop edges.
Attributes:
graph: The :code:`Graph` data object
y: Labels for each nodes
num_classes: Number of classes.
train_index: The index for nodes in training set.
val_index: The index for nodes in validation set.
test_index: The index for nodes in test set.
"""
def __init__(self, name, symmetry_edges=True, self_loop=True):
self.path = get_default_data_dir(name)
self.symmetry_edges = symmetry_edges
self.self_loop = self_loop
self.name = name
self._load_data()
def _load_data(self):
"""Load data
"""
objnames = ['x', 'y', 'tx', 'ty', 'allx', 'ally', 'graph']
objects = []
for i in range(len(objnames)):
with open("{}/ind.{}.{}".format(self.path, self.name, objnames[i]),
'rb') as f:
objects.append(_pickle_load(f))
x, y, tx, ty, allx, ally, _graph = tuple(objects)
test_idx_reorder = _parse_index_file("{}/ind.{}.test.index".format(
self.path, self.name))
test_idx_range = np.sort(test_idx_reorder)
allx = allx.todense()
tx = tx.todense()
if self.name == 'citeseer':
# Fix citeseer dataset (there are some isolated nodes in the graph)
# Find isolated nodes, add them as zero-vecs into the right position
test_idx_range_full = range(
min(test_idx_reorder), max(test_idx_reorder) + 1)
tx_extended = np.zeros(
(len(test_idx_range_full), x.shape[1]), dtype="float32")
tx_extended[test_idx_range - min(test_idx_range), :] = tx
tx = tx_extended
ty_extended = np.zeros(
(len(test_idx_range_full), y.shape[1]), dtype="float32")
ty_extended[test_idx_range - min(test_idx_range), :] = ty
ty = ty_extended
features = np.vstack([allx, tx])
features[test_idx_reorder, :] = features[test_idx_range, :]
features = features / (np.sum(features, axis=-1) + 1e-15)
features = np.array(features, dtype="float32")
_graph = nx.DiGraph(nx.from_dict_of_lists(_graph))
onehot_labels = np.vstack((ally, ty))
onehot_labels[test_idx_reorder, :] = onehot_labels[test_idx_range, :]
labels = np.argmax(onehot_labels, 1)
idx_test = test_idx_range.tolist()
idx_train = range(len(y))
idx_val = range(len(y), len(y) + 500)
all_edges = []
for i in _graph.edges():
u, v = tuple(i)
all_edges.append((u, v))
if self.symmetry_edges:
all_edges.append((v, u))
if self.self_loop:
for i in range(_graph.number_of_nodes()):
all_edges.append((i, i))
all_edges = list(set(all_edges))
self.graph = graph.Graph(
num_nodes=_graph.number_of_nodes(),
edges=all_edges,
node_feat={"words": features})
self.y = np.array(labels, dtype="int64")
self.num_classes = onehot_labels.shape[1]
self.train_index = np.array(idx_train, dtype="int32")
self.val_index = np.array(idx_val, dtype="int32")
self.test_index = np.array(idx_test, dtype="int32")
class CoraDataset(object):
"""Cora dataset implementation
Args:
symmetry_edges: Whether to create symmetry edges.
self_loop: Whether to contain self loop edges.
Attributes:
graph: The :code:`Graph` data object
y: Labels for each nodes
num_classes: Number of classes.
train_index: The index for nodes in training set.
val_index: The index for nodes in validation set.
test_index: The index for nodes in test set.
"""
def __init__(self, symmetry_edges=True, self_loop=True):
self.path = get_default_data_dir("cora")
self.symmetry_edges = symmetry_edges
self.self_loop = self_loop
self._load_data()
def _load_data(self):
"""Load data"""
content = os.path.join(self.path, 'cora.content')
cite = os.path.join(self.path, 'cora.cites')
node_feature = []
paper_ids = []
y = []
y_dict = {}
with open(content, 'r') as f:
for line in f:
line = line.strip().split()
paper_id = int(line[0])
paper_class = line[-1]
if paper_class not in y_dict:
y_dict[paper_class] = len(y_dict)
feature = [int(i) for i in line[1:-1]]
feature_array = np.array(feature, dtype="float32")
# Normalize
feature_array = feature_array / (np.sum(feature_array) + 1e-15)
node_feature.append(feature_array)
y.append(y_dict[paper_class])
paper_ids.append(paper_id)
paper2vid = dict([(v, k) for (k, v) in enumerate(paper_ids)])
num_nodes = len(paper_ids)
node_feature = np.array(node_feature, dtype="float32")
all_edges = []
with open(cite, 'r') as f:
for line in f:
u, v = line.split()
u = paper2vid[int(u)]
v = paper2vid[int(v)]
all_edges.append((u, v))
if self.symmetry_edges:
all_edges.append((v, u))
if self.self_loop:
for i in range(num_nodes):
all_edges.append((i, i))
all_edges = list(set(all_edges))
self.graph = graph.Graph(
num_nodes=num_nodes,
edges=all_edges,
node_feat={"words": node_feature})
perm = np.arange(0, num_nodes)
#np.random.shuffle(perm)
self.train_index = perm[:140]
self.val_index = perm[200:500]
self.test_index = perm[500:1500]
self.y = np.array(y, dtype="int64")
self.num_classes = len(y_dict)
class BlogCatalogDataset(object):
"""BlogCatalog dataset implementation
Args:
symmetry_edges: Whether to create symmetry edges.
self_loop: Whether to contain self loop edges.
Attributes:
graph: The :code:`Graph` data object.
num_groups: Number of classes.
train_index: The index for nodes in training set.
test_index: The index for nodes in validation set.
"""
def __init__(self, symmetry_edges=True, self_loop=False):
self.path = get_default_data_dir("BlogCatalog")
self.num_groups = 39
self.symmetry_edges = symmetry_edges
self.self_loop = self_loop
self._load_data()
def _load_data(self):
edge_path = os.path.join(self.path, 'edges.csv')
node_path = os.path.join(self.path, 'nodes.csv')
group_edge_path = os.path.join(self.path, 'group-edges.csv')
all_edges = []
with io.open(node_path) as inf:
num_nodes = len(inf.readlines())
node_feature = np.zeros((num_nodes, self.num_groups))
with io.open(group_edge_path) as inf:
for line in inf:
node_id, group_id = line.strip('\n').split(',')
node_id, group_id = int(node_id) - 1, int(group_id) - 1
node_feature[node_id][group_id] = 1
with io.open(edge_path) as inf:
for line in inf:
u, v = line.strip('\n').split(',')
u, v = int(u) - 1, int(v) - 1
all_edges.append((u, v))
if self.symmetry_edges:
all_edges.append((v, u))
if self.self_loop:
for i in range(num_nodes):
all_edges.append((i, i))
all_edges = list(set(all_edges))
self.graph = graph.Graph(
num_nodes=num_nodes,
edges=all_edges,
node_feat={"group_id": node_feature})
perm = np.arange(0, num_nodes)
np.random.shuffle(perm)
train_num = int(num_nodes * 0.5)
self.train_index = perm[:train_num]
self.test_index = perm[train_num:]
class ArXivDataset(object):
"""ArXiv dataset implementation
Args:
np_random_seed: The random seed for numpy.
Attributes:
graph: The :code:`Graph` data object.
"""
def __init__(self, np_random_seed=123):
self.path = get_default_data_dir("arXiv")
self.np_random_seed = np_random_seed
self._load_data()
def _load_data(self):
np.random.seed(self.np_random_seed)
edge_path = os.path.join(self.path, 'ca-AstroPh.txt')
bi_edges = set()
self.neg_edges = []
self.pos_edges = []
self.node2id = dict()
def node_id(node):
if node not in self.node2id:
self.node2id[node] = len(self.node2id)
return self.node2id[node]
with io.open(edge_path) as inf:
for _ in range(4):
inf.readline()
for line in inf:
u, v = line.strip('\n').split('\t')
u, v = node_id(u), node_id(v)
if u < v:
bi_edges.add((u, v))
else:
bi_edges.add((v, u))
num_nodes = len(self.node2id)
while len(self.neg_edges) < len(bi_edges) // 2:
random_edges = np.random.choice(num_nodes, [len(bi_edges), 2])
for (u, v) in random_edges:
if u != v and (u, v) not in bi_edges and (v, u
) not in bi_edges:
self.neg_edges.append((u, v))
if len(self.neg_edges) == len(bi_edges) // 2:
break
bi_edges = list(bi_edges)
np.random.shuffle(bi_edges)
self.pos_edges = bi_edges[:len(bi_edges) // 2]
bi_edges = bi_edges[len(bi_edges) // 2:]
all_edges = []
for edge in bi_edges:
u, v = edge
all_edges.append((u, v))
all_edges.append((v, u))
self.graph = graph.Graph(num_nodes=num_nodes, edges=all_edges)
此差异已折叠。
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Fast implementation for graph construction and sampling.
"""
import numpy as np
cimport numpy as np
cimport cython
from libcpp.map cimport map
from libcpp.set cimport set
from libcpp.unordered_set cimport unordered_set
from libcpp.unordered_map cimport unordered_map
from libcpp.vector cimport vector
from libc.stdlib cimport rand, RAND_MAX
@cython.boundscheck(False)
@cython.wraparound(False)
def build_index(np.ndarray[np.int32_t, ndim=1] u,
np.ndarray[np.int32_t, ndim=1] v,
int num_nodes):
"""Building Edge Index
"""
cdef int i
cdef int h=len(u)
cdef int n_size = num_nodes
cdef np.ndarray[np.int32_t, ndim=1] degree = np.zeros([n_size], dtype=np.int32)
cdef np.ndarray[np.int32_t, ndim=1] count = np.zeros([n_size], dtype=np.int32)
cdef np.ndarray[np.int32_t, ndim=1] _tmp_v = np.zeros([h], dtype=np.int32)
cdef np.ndarray[np.int32_t, ndim=1] _tmp_u = np.zeros([h], dtype=np.int32)
cdef np.ndarray[np.int32_t, ndim=1] _tmp_eid = np.zeros([h], dtype=np.int32)
cdef np.ndarray[np.int32_t, ndim=1] indptr = np.zeros([n_size + 1], dtype=np.int32)
with nogil:
for i in xrange(h):
degree[u[i]] += 1
for i in xrange(n_size):
indptr[i + 1] = indptr[i] + degree[i]
for i in xrange(h):
_tmp_v[indptr[u[i]] + count[u[i]]] = v[i]
_tmp_eid[indptr[u[i]] + count[u[i]]] = i
_tmp_u[indptr[u[i]] + count[u[i]]] = u[i]
count[u[i]] += 1
cdef list output_eid = []
cdef list output_v = []
for i in xrange(n_size):
output_eid.append(_tmp_eid[indptr[i]:indptr[i+1]])
output_v.append(_tmp_v[indptr[i]:indptr[i+1]])
return np.array(output_v), np.array(output_eid), degree, _tmp_u, _tmp_v, _tmp_eid
@cython.boundscheck(False)
@cython.wraparound(False)
def map_edges(np.ndarray[np.int32_t, ndim=1] eid,
np.ndarray[np.int32_t, ndim=2] edges,
reindex):
"""Mapping edges by given dictionary
"""
cdef unordered_map[int, int] m = reindex
cdef int i = 0
cdef int h = len(eid)
cdef np.ndarray[np.int32_t, ndim=2] r_edges = np.zeros([h, 2], dtype=np.int32)
cdef int j
with nogil:
for i in xrange(h):
j = eid[i]
r_edges[i, 0] = m[edges[j, 0]]
r_edges[i, 1] = m[edges[j, 1]]
return r_edges
@cython.boundscheck(False)
@cython.wraparound(False)
def map_nodes(nodes, reindex):
"""Mapping nodes by given dictionary
"""
cdef unordered_map[int, int] m = reindex
cdef int i = 0
cdef int h = len(nodes)
cdef np.ndarray[np.int32_t, ndim=1] new_nodes = np.zeros([h], dtype=np.int32)
cdef int j
for i in xrange(h):
j = nodes[i]
new_nodes[i] = m[j]
return new_nodes
@cython.boundscheck(False)
@cython.wraparound(False)
def node2vec_sample(np.ndarray[np.int32_t, ndim=1] succ,
np.ndarray[np.int32_t, ndim=1] prev_succ, int prev_node,
float p, float q):
"""Fast implement of node2vec sampling
"""
cdef int i
cdef succ_len = len(succ)
cdef prev_succ_len = len(prev_succ)
cdef vector[float] probs
cdef float prob_sum = 0
cdef unordered_set[int] prev_succ_set
for i in xrange(prev_succ_len):
prev_succ_set.insert(prev_succ[i])
cdef float prob
for i in xrange(succ_len):
if succ[i] == prev_node:
prob = 1. / p
elif prev_succ_set.find(succ[i]) != prev_succ_set.end():
prob = 1.
else:
prob = 1. / q
probs.push_back(prob)
prob_sum += prob
cdef float rand_num = float(rand())/RAND_MAX * prob_sum
cdef int sample_succ = 0
for i in xrange(succ_len):
rand_num -= probs[i]
if rand_num <= 0:
sample_succ = succ[i]
return sample_succ
此差异已折叠。
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Generate layers api
"""
from pgl.layers import conv
from pgl.layers.conv import *
__all__ = []
__all__ += conv.__all__
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""This package implements common layers to help building
graph neural networks.
"""
import paddle.fluid as fluid
from pgl import graph_wrapper
from pgl.utils import paddle_helper
__all__ = ['gcn', 'gat']
def gcn(gw, feature, hidden_size, activation, name, norm=None):
"""Implementation of graph convolutional neural networks (GCN)
This is an implementation of the paper SEMI-SUPERVISED CLASSIFICATION
WITH GRAPH CONVOLUTIONAL NETWORKS (https://arxiv.org/pdf/1609.02907.pdf).
Args:
gw: Graph wrapper object (:code:`StaticGraphWrapper` or :code:`GraphWrapper`)
feature: A tensor with shape (num_nodes, feature_size).
hidden_size: The hidden size for gcn.
activation: The activation for the output.
name: Gcn layer names.
norm: If :code:`norm` is not None, then the feature will be normalized. Norm must
be tensor with shape (num_nodes,) and dtype float32.
Return:
A tensor with shape (num_nodes, hidden_size)
"""
def send_src_copy(src_feat, dst_feat, edge_feat):
return src_feat["h"]
size = feature.shape[-1]
if size > hidden_size:
feature = fluid.layers.fc(feature,
size=hidden_size,
bias_attr=False,
name=name)
if norm is not None:
feature = feature * norm
msg = gw.send(send_src_copy, nfeat_list=[("h", feature)])
if size > hidden_size:
output = gw.recv(msg, "sum")
else:
output = gw.recv(msg, "sum")
output = fluid.layers.fc(output,
size=hidden_size,
bias_attr=False,
name=name)
if norm is not None:
output = output * norm
bias = fluid.layers.create_parameter(
shape=[hidden_size],
dtype='float32',
is_bias=True,
name=name + '_bias')
output = fluid.layers.elementwise_add(output, bias, act=activation)
return output
def gat(gw,
feature,
hidden_size,
activation,
name,
num_heads=8,
feat_drop=0.6,
attn_drop=0.6,
is_test=False):
"""Implementation of graph attention networks (GAT)
This is an implementation of the paper GRAPH ATTENTION NETWORKS
(https://arxiv.org/abs/1710.10903).
Args:
gw: Graph wrapper object (:code:`StaticGraphWrapper` or :code:`GraphWrapper`)
feature: A tensor with shape (num_nodes, feature_size).
hidden_size: The hidden size for gat.
activation: The activation for the output.
name: Gat layer names.
num_heads: The head number in gat.
feat_drop: Dropout rate for feature.
attn_drop: Dropout rate for attention.
is_test: Whether in test phrase.
Return:
A tensor with shape (num_nodes, hidden_size * num_heads)
"""
def send_attention(src_feat, dst_feat, edge_feat):
output = src_feat["left_a"] + dst_feat["right_a"]
output = fluid.layers.leaky_relu(
output, alpha=0.2) # (num_edges, num_heads)
return {"alpha": output, "h": src_feat["h"]}
def reduce_attention(msg):
alpha = msg["alpha"] # lod-tensor (batch_size, seq_len, num_heads)
h = msg["h"]
alpha = paddle_helper.sequence_softmax(alpha)
old_h = h
h = fluid.layers.reshape(h, [-1, num_heads, hidden_size])
alpha = fluid.layers.reshape(alpha, [-1, num_heads, 1])
if attn_drop > 1e-15:
alpha = fluid.layers.dropout(
alpha,
dropout_prob=attn_drop,
is_test=is_test,
dropout_implementation="upscale_in_train")
h = h * alpha
h = fluid.layers.reshape(h, [-1, num_heads * hidden_size])
h = fluid.layers.lod_reset(h, old_h)
return fluid.layers.sequence_pool(h, "sum")
if feat_drop > 1e-15:
feature = fluid.layers.dropout(
feature,
dropout_prob=feat_drop,
is_test=is_test,
dropout_implementation='upscale_in_train')
ft = fluid.layers.fc(feature,
hidden_size * num_heads,
bias_attr=False,
name=name + '_weight')
left_a = fluid.layers.create_parameter(
shape=[num_heads, hidden_size],
dtype='float32',
name=name + '_gat_l_A')
right_a = fluid.layers.create_parameter(
shape=[num_heads, hidden_size],
dtype='float32',
name=name + '_gat_r_A')
reshape_ft = fluid.layers.reshape(ft, [-1, num_heads, hidden_size])
left_a_value = fluid.layers.reduce_sum(reshape_ft * left_a, -1)
right_a_value = fluid.layers.reduce_sum(reshape_ft * right_a, -1)
msg = gw.send(
send_attention,
nfeat_list=[("h", ft), ("left_a", left_a_value),
("right_a", right_a_value)])
output = gw.recv(msg, reduce_attention)
bias = fluid.layers.create_parameter(
shape=[hidden_size * num_heads],
dtype='float32',
is_bias=True,
name=name + '_bias')
bias.stop_gradient = True
output = fluid.layers.elementwise_add(output, bias, act=activation)
return output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
import numpy as np
from pgl.graph import Graph
class GraphTest(unittest.TestCase):
def setUp(self):
num_nodes = 5
edges = [(0, 1), (1, 2), (3, 4)]
feature = np.random.randn(5, 100)
edge_feature = np.random.randn(3, 100)
self.graph = Graph(
num_nodes=num_nodes,
edges=edges,
node_feat={"feature": feature},
edge_feat={"edge_feature": edge_feature})
def test_subgraph_consistency(self):
node_index = [0, 2, 3, 4]
eid = [2]
subgraph = self.graph.subgraph(node_index, eid)
for key, value in subgraph.node_feat.items():
diff = value - self.graph.node_feat[key][node_index]
diff = np.sqrt(np.sum(diff * diff))
self.assertLessEqual(diff, 1e-6)
for key, value in subgraph.edge_feat.items():
diff = value - self.graph.edge_feat[key][eid]
diff = np.sqrt(np.sum(diff * diff))
self.assertLessEqual(diff, 1e-6)
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest
class ImportTest(unittest.TestCase):
def test_import_pgl_alone(self):
import pgl
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Logger setup
"""
import logging
log = logging.getLogger('pgl')
console = logging.StreamHandler()
log.addHandler(console)
formatter = logging.Formatter(
fmt='[%(levelname)s] %(asctime)s [%(filename)12s:%(lineno)5d]:\t%(message)s'
)
console.setFormatter(formatter)
log.setLevel(logging.DEBUG)
log.propagate = False
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
op package provide some common ops for building paddle model.
"""
import numpy as np
import paddle.fluid as fluid
from pgl.utils import paddle_helper
def nested_lod_reset(data, reset_index):
"""Reset the lod information as the one in given reset_index.
This function apply :code:`fluid.layers.lod_reset` recursively
to all the tensor in nested data.
Args:
data: A dictionary of tensor or tensor.
reset_index: A variable which the target lod information comes from.
Return:
Return a dictionary of LodTensor of LodTensor.
"""
if data is None:
return None
elif isinstance(data, dict):
new_data = {}
for key, value in data.items():
new_data[key] = nested_lod_reset(value, reset_index)
return new_data
else:
return fluid.layers.lod_reset(data, reset_index)
def read_rows(data, index):
"""Slice tensor with given index from dictionary of tensor or tensor
This function helps to slice data from nested dictionary structure.
Args:
data: A dictionary of tensor or tensor
index: A tensor of slicing index
Returns:
Return a dictionary of tensor or tensor.
"""
if data is None:
return None
elif isinstance(data, dict):
new_data = {}
for key, value in data.items():
new_data[key] = read_rows(value, index)
return new_data
else:
return paddle_helper.gather(data, index)
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册