未验证 提交 90ff1f7f 编写于 作者: H Huang Zhengjie 提交者: GitHub

Merge pull request #11 from PaddlePaddle/main

Merge
# data and log
/examples/GaAN/dataset/
/examples/GaAN/log/
/examples/GaAN/__pycache__/
/examples/GaAN/params/
/DoorGod
# Virtualenv
/.venv/
/venv/
......
......@@ -30,13 +30,13 @@ The newly released PGL supports heterogeneous graph learning on both walk based
## Highlight: Efficiency - Support Scatter-Gather and LodTensor Message Passing
One of the most important benefits of graph neural networks compared to other models is the ability to use node-to-node connectivity information, but coding the communication between nodes is very cumbersome. At PGL we adopt **Message Passing Paradigm** similar to [DGL](https://github.com/dmlc/dgl) to help to build a customize graph neural network easily. Users only need to write ```send``` and ```recv``` functions to easily implement a simple GCN. As shown in the following figure, for the first step the send function is defined on the edges of the graph, and the user can customize the send function ![](http://latex.codecogs.com/gif.latex?\\phi^e}) to send the message from the source to the target node. For the second step, the recv function ![](http://latex.codecogs.com/gif.latex?\\phi^v}) is responsible for aggregating ![](http://latex.codecogs.com/gif.latex?\\oplus}) messages together from different sources.
One of the most important benefits of graph neural networks compared to other models is the ability to use node-to-node connectivity information, but coding the communication between nodes is very cumbersome. At PGL we adopt **Message Passing Paradigm** similar to [DGL](https://github.com/dmlc/dgl) to help to build a customize graph neural network easily. Users only need to write ```send``` and ```recv``` functions to easily implement a simple GCN. As shown in the following figure, for the first step the send function is defined on the edges of the graph, and the user can customize the send function ![](http://latex.codecogs.com/gif.latex?\\phi^e) to send the message from the source to the target node. For the second step, the recv function ![](http://latex.codecogs.com/gif.latex?\\phi^v) is responsible for aggregating ![](http://latex.codecogs.com/gif.latex?\\oplus) messages together from different sources.
<img src="./docs/source/_static/message_passing_paradigm.png" alt="The basic idea of message passing paradigm" width="800">
As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function ![](http://latex.codecogs.com/gif.latex?\\oplus}) on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.
As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function ![](http://latex.codecogs.com/gif.latex?\\oplus) on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.
<img src="./docs/source/_static/parallel_degree_bucketing.png" alt="The parallel degree bucketing of PGL" width="800">
......
......@@ -29,11 +29,11 @@ Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/Padd
# 特色:高效性——支持Scatter-Gather及LodTensor消息传递
对比于一般的模型,图神经网络模型最大的优势在于它利用了节点与节点之间连接的信息。但是,如何通过代码来实现建模这些节点连接十分的麻烦。PGL采用与[DGL](https://github.com/dmlc/dgl)相似的**消息传递范式**用于作为构建图神经网络的接口。用于只需要简单的编写```send```还有```recv```函数就能够轻松的实现一个简单的GCN网络。如下图所示,首先,send函数被定义在节点之间的边上,用户自定义send函数![](http://latex.codecogs.com/gif.latex?\\phi^e})会把消息从源点发送到目标节点。然后,recv函数![](http://latex.codecogs.com/gif.latex?\\phi^v})负责将这些消息用汇聚函数 ![](http://latex.codecogs.com/gif.latex?\\oplus}) 汇聚起来。
对比于一般的模型,图神经网络模型最大的优势在于它利用了节点与节点之间连接的信息。但是,如何通过代码来实现建模这些节点连接十分的麻烦。PGL采用与[DGL](https://github.com/dmlc/dgl)相似的**消息传递范式**用于作为构建图神经网络的接口。用于只需要简单的编写```send```还有```recv```函数就能够轻松的实现一个简单的GCN网络。如下图所示,首先,send函数被定义在节点之间的边上,用户自定义send函数![](http://latex.codecogs.com/gif.latex?\\phi^e)会把消息从源点发送到目标节点。然后,recv函数![](http://latex.codecogs.com/gif.latex?\\phi^v)负责将这些消息用汇聚函数 ![](http://latex.codecogs.com/gif.latex?\\oplus) 汇聚起来。
<img src="./docs/source/_static/message_passing_paradigm.png" alt="The basic idea of message passing paradigm" width="800">
如下面左图所示,为了去适配用户定义的汇聚函数,DGL使用了Degree Bucketing来将相同度的节点组合在一个块,然后将汇聚函数![](http://latex.codecogs.com/gif.latex?\\oplus})作用在每个块之上。而对于PGL的用户定义汇聚函数,我们则将消息以PaddlePaddle的[LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html)的形式处理,将若干消息看作一组变长的序列,然后利用**LodTensor在PaddlePaddle的特性进行快速平行的消息聚合**
如下面左图所示,为了去适配用户定义的汇聚函数,DGL使用了Degree Bucketing来将相同度的节点组合在一个块,然后将汇聚函数![](http://latex.codecogs.com/gif.latex?\\oplus)作用在每个块之上。而对于PGL的用户定义汇聚函数,我们则将消息以PaddlePaddle的[LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html)的形式处理,将若干消息看作一组变长的序列,然后利用**LodTensor在PaddlePaddle的特性进行快速平行的消息聚合**
<img src="./docs/source/_static/parallel_degree_bucketing.png" alt="The parallel degree bucketing of PGL" width="800">
......
......@@ -19,8 +19,8 @@ def build_graph():
# Each node can be represented by a d-dimensional feature vector, here for simple, the feature vectors are randomly generated.
d = 16
feature = np.random.randn(num_node, d).astype("float32")
# each edge also can be represented by a feature vector
edge_feature = np.random.randn(len(edge_list), d).astype("float32")
# each edge has it own weight
edge_feature = np.random.randn(len(edge_list), 1).astype("float32")
# create a graph
g = graph.Graph(num_nodes = num_node,
......@@ -66,13 +66,13 @@ In this tutorial, we use a simple Graph Convolutional Network(GCN) developed by
In PGL, we can easily implement a GCN layer as follows:
```python
# define GCN layer function
def gcn_layer(gw, feature, hidden_size, name, activation):
def gcn_layer(gw, nfeat, efeat, hidden_size, name, activation):
# gw is a GraphWrapper;feature is the feature vectors of nodes
# define message function
def send_func(src_feat, dst_feat, edge_feat):
# In this tutorial, we return the feature vector of the source node as message
return src_feat['h']
return src_feat['h'] * edge_feat['e']
# define reduce function
def recv_func(feat):
......@@ -80,7 +80,7 @@ def gcn_layer(gw, feature, hidden_size, name, activation):
return fluid.layers.sequence_pool(feat, pool_type='sum')
# trigger message to passing
msg = gw.send(send_func, nfeat_list=[('h', feature)])
msg = gw.send(send_func, nfeat_list=[('h', nfeat)], efeat_list=[('e', efeat)])
# recv funciton receives message and trigger reduce funcition to handle message
output = gw.recv(msg, recv_func)
output = fluid.layers.fc(output,
......@@ -92,10 +92,10 @@ def gcn_layer(gw, feature, hidden_size, name, activation):
```
After defining the GCN layer, we can construct a deeper GCN model with two GCN layers.
```python
output = gcn_layer(gw, gw.node_feat['feature'],
output = gcn_layer(gw, gw.node_feat['feature'], gw.edge_feat['edge_feature'],
hidden_size=8, name='gcn_layer_1', activation='relu')
output = gcn_layer(gw, output, hidden_size=1,
name='gcn_layer_2', activation=None)
output = gcn_layer(gw, output, gw.edge_feat['edge_feature'],
hidden_size=1, name='gcn_layer_2', activation=None)
```
## Step 3: data preprocessing
......
# GaAN: Gated Attention Networks for Learning on Large and Spatiotemporal Graphs
[GaAN](https://arxiv.org/abs/1803.07294) is a powerful neural network designed for machine learning on graph. It introduces an gated attention mechanism. Based on PGL, we reproduce the GaAN algorithm and train the model on [ogbn-proteins](https://ogb.stanford.edu/docs/nodeprop/#ogbn-proteins).
## Datasets
The ogbn-proteins dataset will be downloaded in directory ./dataset automatically.
## Dependencies
- [paddlepaddle >= 1.6](https://github.com/paddlepaddle/paddle)
- [pgl 1.1](https://github.com/PaddlePaddle/PGL)
- [ogb 1.1.1](https://github.com/snap-stanford/ogb)
## How to run
```bash
python train.py --lr 1e-2 --rc 0 --batch_size 1024 --epochs 100
```
or
```bash
source main.sh
```
### Hyperparameters
- use_gpu: whether to use gpu or not
- mini_data: use a small dataset to test code
- epochs: number of training epochs
- lr: learning rate
- rc: regularization coefficient
- log_path: the path of log
- batch_size: the number of batch size
- heads: the number of heads of attention
- hidden_size_a: the size of query and key vectors
- hidden_size_v: the size of value vectors
- hidden_size_m: the size of projection space for computing gates
- hidden_size_o: the size of output of GaAN layer
## Performance
We train our models for 100 epochs and report the **rocauc** on the test dataset.
|dataset|mean|std|#experiments|
|-|-|-|-|
|ogbn-proteins|0.7803|0.0073|10|
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""This package implements common layers to help building
graph neural networks.
"""
import paddle.fluid as fluid
from pgl import graph_wrapper
from pgl.utils import paddle_helper
__all__ = ['gcn', 'gat', 'gin', 'gaan']
def gcn(gw, feature, hidden_size, activation, name, norm=None):
"""Implementation of graph convolutional neural networks (GCN)
This is an implementation of the paper SEMI-SUPERVISED CLASSIFICATION
WITH GRAPH CONVOLUTIONAL NETWORKS (https://arxiv.org/pdf/1609.02907.pdf).
Args:
gw: Graph wrapper object (:code:`StaticGraphWrapper` or :code:`GraphWrapper`)
feature: A tensor with shape (num_nodes, feature_size).
hidden_size: The hidden size for gcn.
activation: The activation for the output.
name: Gcn layer names.
norm: If :code:`norm` is not None, then the feature will be normalized. Norm must
be tensor with shape (num_nodes,) and dtype float32.
Return:
A tensor with shape (num_nodes, hidden_size)
"""
def send_src_copy(src_feat, dst_feat, edge_feat):
return src_feat["h"]
size = feature.shape[-1]
if size > hidden_size:
feature = fluid.layers.fc(feature,
size=hidden_size,
bias_attr=False,
param_attr=fluid.ParamAttr(name=name))
if norm is not None:
feature = feature * norm
msg = gw.send(send_src_copy, nfeat_list=[("h", feature)])
if size > hidden_size:
output = gw.recv(msg, "sum")
else:
output = gw.recv(msg, "sum")
output = fluid.layers.fc(output,
size=hidden_size,
bias_attr=False,
param_attr=fluid.ParamAttr(name=name))
if norm is not None:
output = output * norm
bias = fluid.layers.create_parameter(
shape=[hidden_size],
dtype='float32',
is_bias=True,
name=name + '_bias')
output = fluid.layers.elementwise_add(output, bias, act=activation)
return output
def gat(gw,
feature,
hidden_size,
activation,
name,
num_heads=8,
feat_drop=0.6,
attn_drop=0.6,
is_test=False):
"""Implementation of graph attention networks (GAT)
This is an implementation of the paper GRAPH ATTENTION NETWORKS
(https://arxiv.org/abs/1710.10903).
Args:
gw: Graph wrapper object (:code:`StaticGraphWrapper` or :code:`GraphWrapper`)
feature: A tensor with shape (num_nodes, feature_size).
hidden_size: The hidden size for gat.
activation: The activation for the output.
name: Gat layer names.
num_heads: The head number in gat.
feat_drop: Dropout rate for feature.
attn_drop: Dropout rate for attention.
is_test: Whether in test phrase.
Return:
A tensor with shape (num_nodes, hidden_size * num_heads)
"""
def send_attention(src_feat, dst_feat, edge_feat):
output = src_feat["left_a"] + dst_feat["right_a"]
output = fluid.layers.leaky_relu(
output, alpha=0.2) # (num_edges, num_heads)
return {"alpha": output, "h": src_feat["h"]}
def reduce_attention(msg):
alpha = msg["alpha"] # lod-tensor (batch_size, seq_len, num_heads)
h = msg["h"]
alpha = paddle_helper.sequence_softmax(alpha)
old_h = h
h = fluid.layers.reshape(h, [-1, num_heads, hidden_size])
alpha = fluid.layers.reshape(alpha, [-1, num_heads, 1])
if attn_drop > 1e-15:
alpha = fluid.layers.dropout(
alpha,
dropout_prob=attn_drop,
is_test=is_test,
dropout_implementation="upscale_in_train")
h = h * alpha
h = fluid.layers.reshape(h, [-1, num_heads * hidden_size])
h = fluid.layers.lod_reset(h, old_h)
return fluid.layers.sequence_pool(h, "sum")
if feat_drop > 1e-15:
feature = fluid.layers.dropout(
feature,
dropout_prob=feat_drop,
is_test=is_test,
dropout_implementation='upscale_in_train')
ft = fluid.layers.fc(feature,
hidden_size * num_heads,
bias_attr=False,
param_attr=fluid.ParamAttr(name=name + '_weight'))
left_a = fluid.layers.create_parameter(
shape=[num_heads, hidden_size],
dtype='float32',
name=name + '_gat_l_A')
right_a = fluid.layers.create_parameter(
shape=[num_heads, hidden_size],
dtype='float32',
name=name + '_gat_r_A')
reshape_ft = fluid.layers.reshape(ft, [-1, num_heads, hidden_size])
left_a_value = fluid.layers.reduce_sum(reshape_ft * left_a, -1)
right_a_value = fluid.layers.reduce_sum(reshape_ft * right_a, -1)
msg = gw.send(
send_attention,
nfeat_list=[("h", ft), ("left_a", left_a_value),
("right_a", right_a_value)])
output = gw.recv(msg, reduce_attention)
bias = fluid.layers.create_parameter(
shape=[hidden_size * num_heads],
dtype='float32',
is_bias=True,
name=name + '_bias')
bias.stop_gradient = True
output = fluid.layers.elementwise_add(output, bias, act=activation)
return output
def gin(gw,
feature,
hidden_size,
activation,
name,
init_eps=0.0,
train_eps=False):
"""Implementation of Graph Isomorphism Network (GIN) layer.
This is an implementation of the paper How Powerful are Graph Neural Networks?
(https://arxiv.org/pdf/1810.00826.pdf).
In their implementation, all MLPs have 2 layers. Batch normalization is applied
on every hidden layer.
Args:
gw: Graph wrapper object (:code:`StaticGraphWrapper` or :code:`GraphWrapper`)
feature: A tensor with shape (num_nodes, feature_size).
name: GIN layer names.
hidden_size: The hidden size for gin.
activation: The activation for the output.
init_eps: float, optional
Initial :math:`\epsilon` value, default is 0.
train_eps: bool, optional
if True, :math:`\epsilon` will be a learnable parameter.
Return:
A tensor with shape (num_nodes, hidden_size).
"""
def send_src_copy(src_feat, dst_feat, edge_feat):
return src_feat["h"]
epsilon = fluid.layers.create_parameter(
shape=[1, 1],
dtype="float32",
attr=fluid.ParamAttr(name="%s_eps" % name),
default_initializer=fluid.initializer.ConstantInitializer(
value=init_eps))
if not train_eps:
epsilon.stop_gradient = True
msg = gw.send(send_src_copy, nfeat_list=[("h", feature)])
output = gw.recv(msg, "sum") + feature * (epsilon + 1.0)
output = fluid.layers.fc(output,
size=hidden_size,
act=None,
param_attr=fluid.ParamAttr(name="%s_w_0" % name),
bias_attr=fluid.ParamAttr(name="%s_b_0" % name))
output = fluid.layers.layer_norm(
output,
begin_norm_axis=1,
param_attr=fluid.ParamAttr(
name="norm_scale_%s" % (name),
initializer=fluid.initializer.Constant(1.0)),
bias_attr=fluid.ParamAttr(
name="norm_bias_%s" % (name),
initializer=fluid.initializer.Constant(0.0)), )
if activation is not None:
output = getattr(fluid.layers, activation)(output)
output = fluid.layers.fc(output,
size=hidden_size,
act=activation,
param_attr=fluid.ParamAttr(name="%s_w_1" % name),
bias_attr=fluid.ParamAttr(name="%s_b_1" % name))
return output
def gaan(gw, feature, hidden_size_a, hidden_size_v, hidden_size_m, hidden_size_o, heads, name):
"""Implementation of GaAN"""
def send_func(src_feat, dst_feat, edge_feat):
# attention score of each edge
# E * (M * D1)
feat_query, feat_key = dst_feat['feat_query'], src_feat['feat_key']
# E * M * D1
old = feat_query
feat_query = fluid.layers.reshape(feat_query, [-1, heads, hidden_size_a])
feat_key = fluid.layers.reshape(feat_key, [-1, heads, hidden_size_a])
# E * M
alpha = fluid.layers.reduce_sum(feat_key * feat_query, dim=-1)
return {'dst_node_feat': dst_feat['node_feat'],
'src_node_feat': src_feat['node_feat'],
'feat_value': src_feat['feat_value'],
'alpha': alpha,
'feat_gate': src_feat['feat_gate']}
def recv_func(message):
dst_feat = message['dst_node_feat']
src_feat = message['src_node_feat']
x = fluid.layers.sequence_pool(dst_feat, 'average')
z = fluid.layers.sequence_pool(src_feat, 'average')
feat_gate = message['feat_gate']
g_max = fluid.layers.sequence_pool(feat_gate, 'max')
g = fluid.layers.concat([x, g_max, z], axis=1)
g = fluid.layers.fc(g, heads, bias_attr=False, act="sigmoid")
# softmax
alpha = message['alpha']
alpha = paddle_helper.sequence_softmax(alpha) # E * M
feat_value = message['feat_value'] # E * (M * D2)
old = feat_value
feat_value = fluid.layers.reshape(feat_value, [-1, heads, hidden_size_v]) # E * M * D2
feat_value = fluid.layers.elementwise_mul(feat_value, alpha, axis=0)
feat_value = fluid.layers.reshape(feat_value, [-1, heads*hidden_size_v]) # E * (M * D2)
feat_value = fluid.layers.lod_reset(feat_value, old)
feat_value = fluid.layers.sequence_pool(feat_value, 'sum') # N * (M * D2)
feat_value = fluid.layers.reshape(feat_value, [-1, heads, hidden_size_v]) # N * M * D2
output = fluid.layers.elementwise_mul(feat_value, g, axis=0)
output = fluid.layers.reshape(output, [-1, heads * hidden_size_v]) # N * (M * D2)
output = fluid.layers.concat([x, output], axis=1)
return output
# N * (D1 * M)
feat_key = fluid.layers.fc(feature, hidden_size_a * heads, bias_attr=False,
param_attr=fluid.ParamAttr(name=name + '_project_key'))
# N * (D2 * M)
feat_value = fluid.layers.fc(feature, hidden_size_v * heads, bias_attr=False,
param_attr=fluid.ParamAttr(name=name + '_project_value'))
# N * (D1 * M)
feat_query = fluid.layers.fc(feature, hidden_size_a * heads, bias_attr=False,
param_attr=fluid.ParamAttr(name=name + '_project_query'))
# N * Dm
feat_gate = fluid.layers.fc(feature, hidden_size_m, bias_attr=False,
param_attr=fluid.ParamAttr(name=name + '_project_gate'))
# send stage
message = gw.send(
send_func,
nfeat_list=[('node_feat', feature), ('feat_key', feat_key), ('feat_value', feat_value),
('feat_query', feat_query), ('feat_gate', feat_gate)],
efeat_list=None,
)
# recv stage
output = gw.recv(message, recv_func)
output = fluid.layers.fc(output, hidden_size_o, bias_attr=False,
param_attr=fluid.ParamAttr(name=name + '_project_output'))
output = fluid.layers.leaky_relu(output, alpha=0.1)
output = fluid.layers.dropout(output, dropout_prob=0.1)
return output
python3 train.py --epochs 100 --lr 1e-2 --rc 0 --batch_size 1024 --gpu_id 0 --exp_id 0
\ No newline at end of file
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from paddle import fluid
from pgl.utils import paddle_helper
# from pgl.layers import gaan
from conv import gaan
class GaANModel(object):
def __init__(self, num_class, num_layers, hidden_size_a=24,
hidden_size_v=32, hidden_size_m=64, hidden_size_o=128,
heads=8, act='relu', name="GaAN"):
self.num_class = num_class
self.num_layers = num_layers
self.hidden_size_a = hidden_size_a
self.hidden_size_v = hidden_size_v
self.hidden_size_m = hidden_size_m
self.hidden_size_o = hidden_size_o
self.act = act
self.name = name
self.heads = heads
def forward(self, gw):
feature = gw.node_feat['node_feat']
for i in range(self.num_layers):
feature = gaan(gw, feature, self.hidden_size_a, self.hidden_size_v,
self.hidden_size_m, self.hidden_size_o, self.heads,
self.name+'_'+str(i))
pred = fluid.layers.fc(
feature, self.num_class, act=None, name=self.name + "_pred_output")
return pred
\ No newline at end of file
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
from ogb.nodeproppred import NodePropPredDataset, Evaluator
import pgl
import numpy as np
import os
import time
def get_graph_data(d_name="ogbn-proteins", mini_data=False):
"""
Param:
d_name: name of dataset
mini_data: if mini_data==True, only use a small dataset (for test)
"""
# import ogb data
dataset = NodePropPredDataset(name = d_name)
num_tasks = dataset.num_tasks # obtaining the number of prediction tasks in a dataset
split_idx = dataset.get_idx_split()
train_idx, valid_idx, test_idx = split_idx["train"], split_idx["valid"], split_idx["test"]
graph, label = dataset[0]
# reshape
graph["edge_index"] = graph["edge_index"].T
# mini dataset
if mini_data:
graph['num_nodes'] = 500
mask = (graph['edge_index'][:, 0] < 500)*(graph['edge_index'][:, 1] < 500)
graph["edge_index"] = graph["edge_index"][mask]
graph["edge_feat"] = graph["edge_feat"][mask]
label = label[:500]
train_idx = np.arange(0,400)
valid_idx = np.arange(400,450)
test_idx = np.arange(450,500)
# read/compute node feature
if mini_data:
node_feat_path = './dataset/ogbn_proteins_node_feat_small.npy'
else:
node_feat_path = './dataset/ogbn_proteins_node_feat.npy'
new_node_feat = None
if os.path.exists(node_feat_path):
print("Begin: read node feature".center(50, '='))
new_node_feat = np.load(node_feat_path)
print("End: read node feature".center(50, '='))
else:
print("Begin: compute node feature".center(50, '='))
start = time.perf_counter()
for i in range(graph['num_nodes']):
if i % 100 == 0:
dur = time.perf_counter() - start
print("{}/{}({}%), times: {:.2f}s".format(
i, graph['num_nodes'], i/graph['num_nodes']*100, dur
))
mask = (graph['edge_index'][:, 0] == i)
current_node_feat = np.mean(np.compress(mask, graph['edge_feat'], axis=0),
axis=0, keepdims=True)
if i == 0:
new_node_feat = [current_node_feat]
else:
new_node_feat.append(current_node_feat)
new_node_feat = np.concatenate(new_node_feat, axis=0)
print("End: compute node feature".center(50,'='))
print("Saving node feature in "+node_feat_path.center(50, '='))
np.save(node_feat_path, new_node_feat)
print("Saving finish".center(50,'='))
print(new_node_feat)
# create graph
g = pgl.graph.Graph(
num_nodes=graph["num_nodes"],
edges = graph["edge_index"],
node_feat = {'node_feat': new_node_feat},
edge_feat = None
)
print("Create graph")
print(g)
return g, label, train_idx, valid_idx, test_idx, Evaluator(d_name)
\ No newline at end of file
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import pickle as pkl
import paddle
import paddle.fluid as fluid
import pgl
import time
from pgl.utils import mp_reader
from pgl.utils.logger import log
import time
import copy
def node_batch_iter(nodes, node_label, batch_size):
"""node_batch_iter
"""
perm = np.arange(len(nodes))
np.random.shuffle(perm)
start = 0
while start < len(nodes):
index = perm[start:start + batch_size]
start += batch_size
yield nodes[index], node_label[index]
def traverse(item):
"""traverse
"""
if isinstance(item, list) or isinstance(item, np.ndarray):
for i in iter(item):
for j in traverse(i):
yield j
else:
yield item
def flat_node_and_edge(nodes):
"""flat_node_and_edge
"""
nodes = list(set(traverse(nodes)))
return nodes
def worker(batch_info, graph, graph_wrapper, samples):
"""Worker
"""
def work():
"""work
"""
_graph_wrapper = copy.copy(graph_wrapper)
_graph_wrapper.node_feat_tensor_dict = {}
for batch_train_samples, batch_train_labels in batch_info:
start_nodes = batch_train_samples
nodes = start_nodes
edges = []
for max_deg in samples:
pred_nodes = graph.sample_predecessor(
start_nodes, max_degree=max_deg)
for dst_node, src_nodes in zip(start_nodes, pred_nodes):
for src_node in src_nodes:
edges.append((src_node, dst_node))
last_nodes = nodes
nodes = [nodes, pred_nodes]
nodes = flat_node_and_edge(nodes)
# Find new nodes
start_nodes = list(set(nodes) - set(last_nodes))
if len(start_nodes) == 0:
break
subgraph = graph.subgraph(
nodes=nodes,
edges=edges,
with_node_feat=True,
with_edge_feat=True)
sub_node_index = subgraph.reindex_from_parrent_nodes(
batch_train_samples)
feed_dict = _graph_wrapper.to_feed(subgraph)
feed_dict["node_label"] = batch_train_labels
feed_dict["node_index"] = sub_node_index
feed_dict["parent_node_index"] = np.array(nodes, dtype="int64")
yield feed_dict
return work
def multiprocess_graph_reader(graph,
graph_wrapper,
samples,
node_index,
batch_size,
node_label,
with_parent_node_index=False,
num_workers=4):
"""multiprocess_graph_reader
"""
def parse_to_subgraph(rd, prefix, node_feat, _with_parent_node_index):
"""parse_to_subgraph
"""
def work():
"""work
"""
for data in rd():
feed_dict = data
for key in node_feat:
feed_dict[prefix + '/node_feat/' + key] = node_feat[key][
feed_dict["parent_node_index"]]
if not _with_parent_node_index:
del feed_dict["parent_node_index"]
yield feed_dict
return work
def reader():
"""reader"""
batch_info = list(
node_batch_iter(
node_index, node_label, batch_size=batch_size))
block_size = int(len(batch_info) / num_workers + 1)
reader_pool = []
for i in range(num_workers):
reader_pool.append(
worker(batch_info[block_size * i:block_size * (i + 1)], graph,
graph_wrapper, samples))
if len(reader_pool) == 1:
r = parse_to_subgraph(reader_pool[0],
repr(graph_wrapper), graph.node_feat,
with_parent_node_index)
else:
multi_process_sample = mp_reader.multiprocess_reader(
reader_pool, use_pipe=True, queue_size=1000)
r = parse_to_subgraph(multi_process_sample,
repr(graph_wrapper), graph.node_feat,
with_parent_node_index)
return paddle.reader.buffered(r, num_workers)
return reader()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from preprocess import get_graph_data
import pgl
import argparse
import numpy as np
import time
from paddle import fluid
import reader
from train_tool import train_epoch, valid_epoch
from model import GaANModel
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="ogb Training")
parser.add_argument("--d_name", type=str, choices=["ogbn-proteins"], default="ogbn-proteins",
help="the name of dataset in ogb")
parser.add_argument("--model", type=str, choices=["GaAN"], default="GaAN",
help="the name of model")
parser.add_argument("--mini_data", type=str, choices=["True", "False"], default="False",
help="use a small dataset to test the code")
parser.add_argument("--use_gpu", type=bool, choices=[True, False], default=True,
help="use gpu")
parser.add_argument("--gpu_id", type=int, default=0,
help="the id of gpu")
parser.add_argument("--exp_id", type=int, default=0,
help="the id of experiment")
parser.add_argument("--epochs", type=int, default=100,
help="the number of training epochs")
parser.add_argument("--lr", type=float, default=1e-2,
help="learning rate of Adam")
parser.add_argument("--rc", type=float, default=0,
help="regularization coefficient")
parser.add_argument("--log_path", type=str, default="./log",
help="the path of log")
parser.add_argument("--batch_size", type=int, default=1024,
help="the number of batch size")
parser.add_argument("--heads", type=int, default=8,
help="the number of heads of attention")
parser.add_argument("--hidden_size_a", type=int, default=24,
help="the hidden size of query and key vectors")
parser.add_argument("--hidden_size_v", type=int, default=32,
help="the hidden size of value vectors")
parser.add_argument("--hidden_size_m", type=int, default=64,
help="the hidden size of projection for computing gates")
parser.add_argument("--hidden_size_o", type=int ,default=128,
help="the hidden size of each layer in GaAN")
args = parser.parse_args()
print("Parameters Setting".center(50, "="))
print("lr = {}, rc = {}, epochs = {}, batch_size = {}".format(args.lr, args.rc, args.epochs,
args.batch_size))
print("Experiment ID: {}".format(args.exp_id).center(50, "="))
print("training in GPU: {}".format(args.gpu_id).center(50, "="))
d_name = args.d_name
# get data
g, label, train_idx, valid_idx, test_idx, evaluator = get_graph_data(d_name=d_name,
mini_data=eval(args.mini_data))
if args.model == "GaAN":
graph_model = GaANModel(112, 3, args.hidden_size_a, args.hidden_size_v, args.hidden_size_m,
args.hidden_size_o, args.heads)
# training
samples = [25, 10] # 2-hop sample size
batch_size = args.batch_size
sample_workers = 1
place = fluid.CUDAPlace(args.gpu_id) if args.use_gpu else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.GraphWrapper(
name='graph',
place = place,
node_feat=g.node_feat_info(),
edge_feat=g.edge_feat_info()
)
node_index = fluid.layers.data('node_index', shape=[None, 1], dtype="int64",
append_batch_size=False)
node_label = fluid.layers.data('node_label', shape=[None, 112], dtype="float32",
append_batch_size=False)
parent_node_index = fluid.layers.data('parent_node_index', shape=[None, 1], dtype="int64",
append_batch_size=False)
output = graph_model.forward(gw)
output = fluid.layers.gather(output, node_index)
score = fluid.layers.sigmoid(output)
loss = fluid.layers.sigmoid_cross_entropy_with_logits(
x=output, label=node_label)
loss = fluid.layers.mean(loss)
val_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
lr = args.lr
adam = fluid.optimizer.Adam(
learning_rate=lr,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=args.rc))
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(startup_program)
train_iter = reader.multiprocess_graph_reader(
g,
gw,
samples=samples,
num_workers=sample_workers,
batch_size=batch_size,
with_parent_node_index=True,
node_index=train_idx,
node_label=np.array(label[train_idx], dtype='float32'))
val_iter = reader.multiprocess_graph_reader(
g,
gw,
samples=samples,
num_workers=sample_workers,
batch_size=batch_size,
with_parent_node_index=True,
node_index=valid_idx,
node_label=np.array(label[valid_idx], dtype='float32'))
test_iter = reader.multiprocess_graph_reader(
g,
gw,
samples=samples,
num_workers=sample_workers,
batch_size=batch_size,
with_parent_node_index=True,
node_index=test_idx,
node_label=np.array(label[test_idx], dtype='float32'))
start = time.time()
print("Training Begin".center(50, "="))
best_valid = -1.0
for epoch in range(args.epochs):
start_e = time.time()
train_loss, train_rocauc = train_epoch(
train_iter, program=train_program, exe=exe, loss=loss, score=score,
evaluator=evaluator, epoch=epoch
)
valid_loss, valid_rocauc = valid_epoch(
val_iter, program=val_program, exe=exe, loss=loss, score=score,
evaluator=evaluator, epoch=epoch)
end_e = time.time()
print("Epoch {}: train_loss={:.4},val_loss={:.4}, train_rocauc={:.4}, val_rocauc={:.4}, s/epoch={:.3}".format(
epoch, train_loss, valid_loss, train_rocauc, valid_rocauc, end_e-start_e
))
if valid_rocauc > best_valid:
print("Update: new {}, old {}".format(valid_rocauc, best_valid))
best_valid = valid_rocauc
fluid.io.save_params(executor=exe, dirname='./params/'+str(args.exp_id), main_program=val_program)
print("Test Stage".center(50, "="))
fluid.io.load_params(executor=exe, dirname='./params/'+str(args.exp_id), main_program=val_program)
test_loss, test_rocauc = valid_epoch(
test_iter, program=val_program, exe=exe, loss=loss, score=score,
evaluator=evaluator, epoch=epoch)
end = time.time()
print("test_loss={:.4},test_rocauc={:.4}, Total Time={:.3}".format(
test_loss, test_rocauc, end-start
))
print("End".center(50, "="))
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import time
from pgl.utils.logger import log
def train_epoch(batch_iter, exe, program, loss, score, evaluator, epoch, log_per_step=1):
batch = 0
total_loss = 0.0
total_sample = 0
result = 0
for batch_feed_dict in batch_iter():
batch += 1
batch_loss, y_pred = exe.run(program, fetch_list=[loss, score], feed=batch_feed_dict)
num_samples = len(batch_feed_dict["node_index"])
total_loss += batch_loss * num_samples
total_sample += num_samples
input_dict = {
"y_true": batch_feed_dict["node_label"],
"y_pred": y_pred
}
result += evaluator.eval(input_dict)["rocauc"]
return total_loss.item()/total_sample, result/batch
def valid_epoch(batch_iter, exe, program, loss, score, evaluator, epoch, log_per_step=1):
batch = 0
total_sample = 0
result = 0
total_loss = 0.0
for batch_feed_dict in batch_iter():
batch += 1
batch_loss, y_pred = exe.run(program, fetch_list=[loss, score], feed=batch_feed_dict)
input_dict = {
"y_true": batch_feed_dict["node_label"],
"y_pred": y_pred
}
result += evaluator.eval(input_dict)["rocauc"]
num_samples = len(batch_feed_dict["node_index"])
total_loss += batch_loss * num_samples
total_sample += num_samples
return total_loss.item()/total_sample, result/batch
# Self-Attention Graph Pooling
SAGPool is a graph pooling method based on self-attention. Self-attention uses graph convolution, which allows the pooling method to consider both node features and graph topology. Based on PGL, we implement the SAGPool algorithm and train the model on five datasets.
## Datasets
There are five datasets, including D&D, PROTEINS, NCI1, NCI109 and FRANKENSTEIN. You can download the datasets from [here](https://bj.bcebos.com/paddle-pgl/SAGPool/data.zip), and unzip it directly. The pkl format datasets should be in directory ./data.
## Dependencies
- [paddlepaddle >= 1.8](https://github.com/PaddlePaddle/paddle)
- [pgl 1.1](https://github.com/PaddlePaddle/PGL)
## How to run
```
python main.py --dataset_name DD --learning_rate 0.005 --weight_decay 0.00001
python main.py --dataset_name PROTEINS --learning_rate 0.001 --hidden_size 32 --weight_decay 0.00001
python main.py --dataset_name NCI1 --learning_rate 0.001 --weight_decay 0.00001
python main.py --dataset_name NCI109 --learning_rate 0.0005 --hidden_size 64 --weight_decay 0.0001 --patience 200
python main.py --dataset_name FRANKENSTEIN --learning_rate 0.001 --weight_decay 0.0001
```
## Hyperparameters
- seed: random seed
- batch\_size: the number of batch size
- learning\_rate: learning rate of optimizer
- weight\_decay: the weight decay for L2 regularization
- hidden\_size: the hidden size of gcn
- pooling\_ratio: the pooling ratio of SAGPool
- dropout\_ratio: the number of dropout ratio
- dataset\_name: the name of datasets, including DD, PROTEINS, NCI1, NCI109, FRANKENSTEIN
- epochs: maximum number of epochs
- patience: patience for early stopping
- use\_cuda: whether to use cuda
- save\_model: the name for the best model
## Performance
We evaluate the implemented method for 20 random seeds using 10-fold cross validation, following the same training procedures as in the paper.
| dataset | mean accuracy | standard deviation | mean accuracy(paper) | standard deviation(paper) |
| ------------ | ------------- | ------------------ | -------------------- | ------------------------- |
| DD | 74.4181 | 1.0244 | 76.19 | 0.94 |
| PROTEINS | 72.7858 | 0.6617 | 70.04 | 1.47 |
| NCI1 | 75.781 | 1.2125 | 74.18 | 1.2 |
| NCI109 | 74.3156 | 1.3 | 74.06 | 0.78 |
| FRANKENSTEIN | 60.7826 | 0.629 | 62.57 | 0.6 |
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('--seed', type=int, default=777,
help='seed')
parser.add_argument('--batch_size', type=int, default=128,
help='batch size')
parser.add_argument('--learning_rate', type=float, default=0.0005,
help='learning rate')
parser.add_argument('--weight_decay', type=float, default=0.0001,
help='weight decay')
parser.add_argument('--hidden_size', type=int, default=128,
help='gcn hidden size')
parser.add_argument('--pooling_ratio', type=float, default=0.5,
help='pooling ratio of SAGPool')
parser.add_argument('--dropout_ratio', type=float, default=0.5,
help='dropout ratio')
parser.add_argument('--dataset_name', type=str, default='DD',
help='DD/PROTEINS/NCI1/NCI109/FRANKENSTEIN')
parser.add_argument('--epochs', type=int, default=100000,
help='maximum number of epochs')
parser.add_argument('--patience', type=int, default=50,
help='patience for early stopping')
parser.add_argument('--use_cuda', type=bool, default=True,
help='use cuda or cpu')
parser.add_argument('--save_model', type=str,
help='save model name')
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import os
import random
import pgl
from pgl.utils.logger import log
from pgl.graph import Graph, MultiGraph
import numpy as np
import pickle
class BaseDataset(object):
def __init__(self):
pass
def __getitem__(self, idx):
raise NotImplementedError
def __len__(self):
raise NotImplementedError
class Subset(BaseDataset):
"""Subset of a dataset at specified indices.
Args:
dataset (Dataset): The whole Dataset
indices (sequence): Indices in the whole set selected for subset
"""
def __init__(self, dataset, indices):
self.dataset = dataset
self.indices = indices
def __getitem__(self, idx):
return self.dataset[self.indices[idx]]
def __len__(self):
return len(self.indices)
class Dataset(BaseDataset):
def __init__(self, args):
self.args = args
with open('data/%s.pkl' % args.dataset_name, 'rb') as f:
graphs_info_list = pickle.load(f)
self.pgl_graph_list = []
self.graph_label_list = []
for i in range(len(graphs_info_list) - 1):
graph = graphs_info_list[i]
edges_l, edges_r = graph["edge_src"], graph["edge_dst"]
# add self-loops
if self.args.dataset_name != "FRANKENSTEIN":
num_nodes = graph["num_nodes"]
x = np.arange(0, num_nodes)
edges_l = np.append(edges_l, x)
edges_r = np.append(edges_r, x)
edges = list(zip(edges_l, edges_r))
g = pgl.graph.Graph(num_nodes=graph["num_nodes"], edges=edges)
g.node_feat["feat"] = graph["node_feat"]
self.pgl_graph_list.append(g)
self.graph_label_list.append(graph["label"])
self.num_classes = graphs_info_list[-1]["num_classes"]
self.num_features = graphs_info_list[-1]["num_features"]
def __getitem__(self, idx):
return self.pgl_graph_list[idx], self.graph_label_list[idx]
def shuffle(self):
"""shuffle the dataset.
"""
cc = list(zip(self.pgl_graph_list, self.graph_label_list))
random.seed(self.args.seed)
random.shuffle(cc)
a, b = zip(*cc)
self.pgl_graph_list[:], self.graph_label_list[:] = a, b
def __len__(self):
return len(self.pgl_graph_list)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle.fluid as fluid
import paddle.fluid.layers as L
def norm_gcn(gw, feature, hidden_size, activation, name, norm=None):
"""Implementation of graph convolutional neural networks(GCN), using different
normalization method.
Args:
gw: Graph wrapper object.
feature: A tensor with shape (num_nodes, feature_size).
hidden_size: The hidden size for norm gcn.
activation: The activation for the output.
name: Norm gcn layer names.
norm: If norm is not None, then the feature will be normalized. Norm must
be tensor with shape (num_nodes,) and dtype float32.
Return:
A tensor with shape (num_nodes, hidden_size)
"""
size = feature.shape[-1]
feature = L.fc(feature,
size=hidden_size,
bias_attr=False,
param_attr=fluid.ParamAttr(name=name))
if norm is not None:
src, dst = gw.edges
norm_src = L.gather(norm, src, overwrite=False)
norm_dst = L.gather(norm, dst, overwrite=False)
norm = norm_src * norm_dst
def send_src_copy(src_feat, dst_feat, edge_feat):
return src_feat["h"] * norm
else:
def send_src_copy(src_feat, dst_feat, edge_feat):
return src_feat["h"]
msg = gw.send(send_src_copy, nfeat_list=[("h", feature)])
output = gw.recv(msg, "sum")
bias = L.create_parameter(
shape=[hidden_size],
dtype='float32',
is_bias=True,
name=name + '_bias')
output = L.elementwise_add(output, bias, act=activation)
return output
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import collections
import paddle
import pgl
from pgl.utils.logger import log
from pgl.graph import Graph, MultiGraph
def batch_iter(data, batch_size):
"""node_batch_iter
"""
size = len(data)
perm = np.arange(size)
np.random.shuffle(perm)
start = 0
while start < size:
index = perm[start:start + batch_size]
start += batch_size
yield data[index]
def scan_batch_iter(data, batch_size):
"""scan_batch_iter
"""
batch = []
for example in data.scan():
batch.append(example)
if len(batch) == batch_size:
yield batch
batch = []
if len(batch) > 0:
yield batch
def label_to_onehot(labels):
"""Return one-hot representations of labels
"""
onehot_labels = []
for label in labels:
if label == 0:
onehot_labels.append([1, 0])
else:
onehot_labels.append([0, 1])
onehot_labels = np.array(onehot_labels)
return onehot_labels
class GraphDataloader(object):
"""Graph Dataloader
"""
def __init__(self,
dataset,
graph_wrapper,
batch_size,
seed=0,
buf_size=1000,
shuffle=True):
self.shuffle = shuffle
self.seed = seed
self.batch_size = batch_size
self.dataset = dataset
self.buf_size = buf_size
self.graph_wrapper = graph_wrapper
def batch_fn(self, batch_examples):
""" batch_fun batch producer """
graphs = [b[0] for b in batch_examples]
labels = [b[1] for b in batch_examples]
join_graph = MultiGraph(graphs)
# normalize
indegree = join_graph.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
join_graph.node_feat["norm"] = np.expand_dims(norm, -1)
feed_dict = self.graph_wrapper.to_feed(join_graph)
labels = np.array(labels)
feed_dict["labels_1dim"] = labels
labels = label_to_onehot(labels)
feed_dict["labels"] = labels
graph_lod = join_graph.graph_lod
graph_id = []
for i in range(1, len(graph_lod)):
graph_node_num = graph_lod[i] - graph_lod[i - 1]
graph_id += [i - 1] * graph_node_num
graph_id = np.array(graph_id, dtype="int32")
feed_dict["graph_id"] = graph_id
return feed_dict
def batch_iter(self):
""" batch_iter """
if self.shuffle:
for batch in batch_iter(self, self.batch_size):
yield batch
else:
for batch in scan_batch_iter(self, self.batch_size):
yield batch
def __len__(self):
"""__len__"""
return len(self.dataset)
def __getitem__(self, idx):
"""__getitem__"""
if isinstance(idx, collections.Iterable):
return [self.dataset[bidx] for bidx in idx]
else:
return self.dataset[idx]
def __iter__(self):
"""__iter__"""
def func_run():
for batch_examples in self.batch_iter():
batch_dict = self.batch_fn(batch_examples)
yield batch_dict
r = paddle.reader.buffered(func_run, self.buf_size)
for batch in r():
yield batch
def scan(self):
"""scan"""
for example in self.dataset:
yield example
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as L
import pgl
from pgl.graph_wrapper import GraphWrapper
from pgl.utils.logger import log
from conv import norm_gcn
from pgl.layers.conv import gcn
def topk_pool(gw, score, graph_id, ratio):
"""Implementation of topk pooling, where k means pooling ratio.
Args:
gw: Graph wrapper object.
score: The attention score of all nodes, which is used to select
important nodes.
graph_id: The graphs that the nodes belong to.
ratio: The pooling ratio of nodes we want to select.
Return:
perm: The index of nodes we choose.
ratio_length: The selected node numbers of each graph.
"""
graph_lod = gw.graph_lod
graph_nodes = gw.num_nodes
num_graph = gw.num_graph
num_nodes = L.ones(shape=[graph_nodes], dtype="float32")
num_nodes = L.lod_reset(num_nodes, graph_lod)
num_nodes_per_graph = L.sequence_pool(num_nodes, pool_type='sum')
max_num_nodes = L.reduce_max(num_nodes_per_graph, dim=0)
max_num_nodes = L.cast(max_num_nodes, dtype="int32")
index = L.arange(0, gw.num_nodes, dtype="int64")
offset = L.gather(graph_lod, graph_id, overwrite=False)
index = (index - offset) + (graph_id * max_num_nodes)
index.stop_gradient = True
# padding
dense_score = L.fill_constant(shape=[num_graph * max_num_nodes],
dtype="float32", value=-999999)
index = L.reshape(index, shape=[-1])
dense_score = L.scatter(dense_score, index, updates=score)
num_graph = L.cast(num_graph, dtype="int32")
dense_score = L.reshape(dense_score,
shape=[num_graph, max_num_nodes])
# record the sorted index
_, sort_index = L.argsort(dense_score, axis=-1, descending=True)
# recover the index range
graph_lod = graph_lod[:-1]
graph_lod = L.reshape(graph_lod, shape=[-1, 1])
graph_lod = L.cast(graph_lod, dtype="int64")
sort_index = L.elementwise_add(sort_index, graph_lod, axis=-1)
sort_index = L.reshape(sort_index, shape=[-1, 1])
# use sequence_slice to choose selected node index
pad_lod = L.arange(0, (num_graph + 1) * max_num_nodes, step=max_num_nodes, dtype="int32")
sort_index = L.lod_reset(sort_index, pad_lod)
ratio_length = L.ceil(num_nodes_per_graph * ratio)
ratio_length = L.cast(ratio_length, dtype="int64")
ratio_length = L.reshape(ratio_length, shape=[-1, 1])
offset = L.zeros(shape=[num_graph, 1], dtype="int64")
choose_index = L.sequence_slice(input=sort_index, offset=offset, length=ratio_length)
perm = L.reshape(choose_index, shape=[-1])
return perm, ratio_length
def sag_pool(gw, feature, ratio, graph_id, dataset, name, activation=L.tanh):
"""Implementation of self-attention graph pooling (SAGPool)
This is an implementation of the paper SELF-ATTENTION GRAPH POOLING
(https://arxiv.org/pdf/1904.08082.pdf)
Args:
gw: Graph wrapper object.
feature: A tensor with shape (num_nodes, feature_size).
ratio: The pooling ratio of nodes we want to select.
graph_id: The graphs that the nodes belong to.
dataset: To differentiate FRANKENSTEIN dataset and other datasets.
name: The name of SAGPool layer.
activation: The activation function.
Return:
new_feature: A tensor with shape (num_nodes, feature_size), and the unselected
nodes' feature is masked by zero.
ratio_length: The selected node numbers of each graph.
"""
if dataset == "FRANKENSTEIN":
gcn_ = gcn
else:
gcn_ = norm_gcn
score = gcn_(gw=gw,
feature=feature,
hidden_size=1,
activation=None,
norm=gw.node_feat["norm"],
name=name)
score = L.squeeze(score, axes=[])
perm, ratio_length = topk_pool(gw, score, graph_id, ratio)
mask = L.zeros_like(score)
mask = L.cast(mask, dtype="float32")
updates = L.ones_like(perm)
updates = L.cast(updates, dtype="float32")
mask = L.scatter(mask, perm, updates)
new_feature = L.elementwise_mul(feature, mask, axis=0)
temp_score = activation(score)
new_feature = L.elementwise_mul(new_feature, temp_score, axis=0)
return new_feature, ratio_length
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import os
import argparse
import pgl
from pgl.utils.logger import log
import paddle
import re
import time
import random
import numpy as np
import math
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as L
import pgl
from pgl.utils.logger import log
from model import GlobalModel
from base_dataset import Subset, Dataset
from dataloader import GraphDataloader
from args import parser
import warnings
from sklearn.model_selection import KFold
warnings.filterwarnings("ignore")
def main(args, train_dataset, val_dataset, test_dataset):
"""main function for running one testing results.
"""
log.info("Train Examples: %s" % len(train_dataset))
log.info("Val Examples: %s" % len(val_dataset))
log.info("Test Examples: %s" % len(test_dataset))
train_program = fluid.Program()
train_program.random_seed = args.seed
startup_program = fluid.Program()
startup_program.random_seed = args.seed
if args.use_cuda:
place = fluid.CUDAPlace(0)
else:
place = fluid.CPUPlace()
exe = fluid.Executor(place)
log.info("building model")
with fluid.program_guard(train_program, startup_program):
with fluid.unique_name.guard():
graph_model = GlobalModel(args, dataset)
train_loader = GraphDataloader(train_dataset,
graph_model.graph_wrapper,
batch_size=args.batch_size)
optimizer = fluid.optimizer.Adam(learning_rate=args.learning_rate,
regularization=fluid.regularizer.L2DecayRegularizer(args.weight_decay))
optimizer.minimize(graph_model.loss)
exe.run(startup_program)
test_program = fluid.Program()
test_program = train_program.clone(for_test=True)
val_loader = GraphDataloader(val_dataset,
graph_model.graph_wrapper,
batch_size=args.batch_size,
shuffle=False)
test_loader = GraphDataloader(test_dataset,
graph_model.graph_wrapper,
batch_size=args.batch_size,
shuffle=False)
min_loss = 1e10
global_step = 0
for epoch in range(args.epochs):
for feed_dict in train_loader:
loss, pred = exe.run(train_program,
feed=feed_dict,
fetch_list=[graph_model.loss, graph_model.pred])
log.info("Epoch: %d, global_step: %d, Training loss: %f" \
% (epoch, global_step, loss))
global_step += 1
# validation
valid_loss = 0.
correct = 0.
for feed_dict in val_loader:
valid_loss_, correct_ = exe.run(test_program,
feed=feed_dict,
fetch_list=[graph_model.loss, graph_model.correct])
valid_loss += valid_loss_
correct += correct_
if epoch % 50 == 0:
log.info("Epoch:%d, Validation loss: %f, Validation acc: %f" \
% (epoch, valid_loss, correct / len(val_loader)))
if valid_loss < min_loss:
min_loss = valid_loss
patience = 0
path = "./save/%s" % args.dataset_name
if not os.path.exists(path):
os.makedirs(path)
fluid.save(train_program, "%s/%s" \
% (path, args.save_model))
log.info("Model saved at epoch %d" % epoch)
else:
patience += 1
if patience > args.patience:
break
correct = 0.
fluid.load(test_program, "./save/%s/%s" \
% (args.dataset_name, args.save_model), exe)
for feed_dict in test_loader:
correct_ = exe.run(test_program,
feed=feed_dict,
fetch_list=[graph_model.correct])
correct += correct_[0]
log.info("Test acc: %f" % (correct / len(test_loader)))
return correct / len(test_loader)
def split_10_cv(dataset, args):
"""10 folds cross validation
"""
dataset.shuffle()
X = np.array([0] * len(dataset))
y = X
kf = KFold(n_splits=10, shuffle=False)
i = 1
test_acc = []
for train_index, test_index in kf.split(X, y):
train_val_dataset = Subset(dataset, train_index)
test_dataset = Subset(dataset, test_index)
train_val_index_range = list(range(0, len(train_val_dataset)))
num_val = int(len(train_val_dataset) / 9)
val_dataset = Subset(train_val_dataset, train_val_index_range[:num_val])
train_dataset = Subset(train_val_dataset, train_val_index_range[num_val:])
log.info("######%d fold of 10-fold cross validation######" % i)
i += 1
test_acc_ = main(args, train_dataset, val_dataset, test_dataset)
test_acc.append(test_acc_)
mean_acc = sum(test_acc) / len(test_acc)
return mean_acc, test_acc
def random_seed_20(args, dataset):
"""run for 20 random seeds
"""
alist = random.sample(range(1,1000),20)
test_acc_fold = []
for seed in alist:
log.info('############ Seed %d ############' % seed)
args.seed = seed
test_acc_fold_, _ = split_10_cv(dataset, args)
log.info('Mean test acc at seed %d: %f' % (seed, test_acc_fold_))
test_acc_fold.append(test_acc_fold_)
mean_acc = sum(test_acc_fold) / len(test_acc_fold)
temp = [(acc - mean_acc) * (acc - mean_acc) for acc in test_acc_fold]
standard_std = math.sqrt(sum(temp) / len(test_acc_fold))
log.info('Final mean test acc using 20 random seeds(mean for 10-fold): %f' % (mean_acc))
log.info('Final standard std using 20 random seeds(mean for 10-fold): %f' % (standard_std))
if __name__ == "__main__":
args = parser.parse_args()
log.info('loading data...')
dataset = Dataset(args)
log.info("preprocess finish.")
args.num_classes = dataset.num_classes
args.num_features = dataset.num_features
random_seed_20(args, dataset)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from random import random
import numpy as np
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as L
import pgl
from pgl.graph import Graph, MultiGraph
from pgl.graph_wrapper import GraphWrapper
from pgl.utils.logger import log
from pgl.layers.conv import gcn
from layers import sag_pool
from conv import norm_gcn
class GlobalModel(object):
"""Implementation of global pooling architecture with SAGPool.
"""
def __init__(self, args, dataset):
self.args = args
self.dataset = dataset
self.hidden_size = args.hidden_size
self.num_classes = args.num_classes
self.num_features = args.num_features
self.pooling_ratio = args.pooling_ratio
self.dropout_ratio = args.dropout_ratio
self.batch_size = args.batch_size
graph_data = []
g, label = self.dataset[0]
graph_data.append(g)
g, label = self.dataset[1]
graph_data.append(g)
batch_graph = MultiGraph(graph_data)
indegree = batch_graph.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
batch_graph.node_feat["norm"] = np.expand_dims(norm, -1)
graph_data = batch_graph
self.graph_wrapper = GraphWrapper(
name="graph",
node_feat=graph_data.node_feat_info()
)
self.labels = L.data(
"labels",
shape=[None, self.args.num_classes],
dtype="int32",
append_batch_size=False)
self.labels_1dim = L.data(
"labels_1dim",
shape=[None],
dtype="int32",
append_batch_size=False)
self.graph_id = L.data(
"graph_id",
shape=[None],
dtype="int32",
append_batch_size=False)
if self.args.dataset_name == "FRANKENSTEIN":
self.gcn = gcn
else:
self.gcn = norm_gcn
self.build_model()
def build_model(self):
node_features = self.graph_wrapper.node_feat["feat"]
output = self.gcn(gw=self.graph_wrapper,
feature=node_features,
hidden_size=self.hidden_size,
activation="relu",
norm=self.graph_wrapper.node_feat["norm"],
name="gcn_layer_1")
output1 = output
output = self.gcn(gw=self.graph_wrapper,
feature=output,
hidden_size=self.hidden_size,
activation="relu",
norm=self.graph_wrapper.node_feat["norm"],
name="gcn_layer_2")
output2 = output
output = self.gcn(gw=self.graph_wrapper,
feature=output,
hidden_size=self.hidden_size,
activation="relu",
norm=self.graph_wrapper.node_feat["norm"],
name="gcn_layer_3")
output = L.concat(input=[output1, output2, output], axis=-1)
output, ratio_length = sag_pool(gw=self.graph_wrapper,
feature=output,
ratio=self.pooling_ratio,
graph_id=self.graph_id,
dataset=self.args.dataset_name,
name="sag_pool_1")
output = L.lod_reset(output, self.graph_wrapper.graph_lod)
cat1 = L.sequence_pool(output, "sum")
ratio_length = L.cast(ratio_length, dtype="float32")
cat1 = L.elementwise_div(cat1, ratio_length, axis=-1)
cat2 = L.sequence_pool(output, "max")
output = L.concat(input=[cat2, cat1], axis=-1)
output = L.fc(output, size=self.hidden_size, act="relu")
output = L.dropout(output, dropout_prob=self.dropout_ratio)
output = L.fc(output, size=self.hidden_size // 2, act="relu")
output = L.fc(output, size=self.num_classes, act=None,
param_attr=fluid.ParamAttr(name="final_fc"))
self.labels = L.cast(self.labels, dtype="float32")
loss = L.sigmoid_cross_entropy_with_logits(x=output, label=self.labels)
self.loss = L.mean(loss)
pred = L.sigmoid(output)
self.pred = L.argmax(x=pred, axis=-1)
correct = L.equal(self.pred, self.labels_1dim)
correct = L.cast(correct, dtype="int32")
self.correct = L.reduce_sum(correct)
# DeeperGCN: All You Need to Train Deeper GCNs
see more information in https://arxiv.org/pdf/2006.07739.pdf
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.6
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy |
| --- | --- |
| Cora | ~77% |
### How to run
For examples, use gpu to train gat on cora dataset.
```
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
import pgl
import paddle.fluid as fluid
def DeeperGCN(gw, feature, num_layers,
hidden_size, num_tasks, name, dropout_prob):
"""Implementation of DeeperGCN, see the paper
"DeeperGCN: All You Need to Train Deeper GCNs" in
https://arxiv.org/pdf/2006.07739.pdf
Args:
gw: Graph wrapper object
feature: A tensor with shape (num_nodes, feature_size)
num_layers: num of layers in DeeperGCN
hidden_size: hidden_size in DeeperGCN
num_tasks: final prediction
name: deeper gcn layer names
dropout_prob: dropout prob in DeeperGCN
Return:
A tensor with shape (num_nodes, hidden_size)
"""
beta = "dynamic"
feature = fluid.layers.fc(feature,
hidden_size,
bias_attr=False,
param_attr=fluid.ParamAttr(name=name + '_weight'))
output = pgl.layers.gen_conv(gw, feature, name=name+"_gen_conv_0", beta=beta)
for layer in range(num_layers):
# LN/BN->ReLU->GraphConv->Res
old_output = output
# 1. Layer Norm
output = fluid.layers.layer_norm(
output,
begin_norm_axis=1,
param_attr=fluid.ParamAttr(
name="norm_scale_%s_%d" % (name, layer),
initializer=fluid.initializer.Constant(1.0)),
bias_attr=fluid.ParamAttr(
name="norm_bias_%s_%d" % (name, layer),
initializer=fluid.initializer.Constant(0.0)))
# 2. ReLU
output = fluid.layers.relu(output)
#3. dropout
output = fluid.layers.dropout(output,
dropout_prob=dropout_prob,
dropout_implementation="upscale_in_train")
#4 gen_conv
output = pgl.layers.gen_conv(gw, output,
name=name+"_gen_conv_%d"%layer, beta=beta)
#5 res
output = output + old_output
# final layer: LN + relu + droput
output = fluid.layers.layer_norm(
output,
begin_norm_axis=1,
param_attr=fluid.ParamAttr(
name="norm_scale_%s_%d" % (name, num_layers),
initializer=fluid.initializer.Constant(1.0)),
bias_attr=fluid.ParamAttr(
name="norm_bias_%s_%d" % (name, num_layers),
initializer=fluid.initializer.Constant(0.0)))
output = fluid.layers.relu(output)
output = fluid.layers.dropout(output,
dropout_prob=dropout_prob,
dropout_implementation="upscale_in_train")
# final prediction
output = fluid.layers.fc(output,
num_tasks,
bias_attr=False,
param_attr=fluid.ParamAttr(name=name + '_final_weight'))
return output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#-*- coding: utf-8 -*-
import pgl
from pgl import data_loader
from pgl.utils.logger import log
import paddle.fluid as fluid
import numpy as np
import time
import argparse
from pgl.utils.log_writer import LogWriter # vdl
from model import DeeperGCN
def load(name):
if name == 'cora':
dataset = data_loader.CoraDataset()
elif name == "pubmed":
dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
elif name == "citeseer":
dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def main(args):
# vdl
writer = LogWriter("checkpoints/train_history")
dataset = load(args.dataset)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
test_program = fluid.Program()
hidden_size = 64
num_layers = 7
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.GraphWrapper(
name="graph",
node_feat=dataset.graph.node_feat_info())
output = DeeperGCN(gw,
gw.node_feat["words"],
num_layers,
hidden_size,
dataset.num_classes,
"deepercnn",
0.1)
node_index = fluid.layers.data(
"node_index",
shape=[None, 1],
dtype="int64",
append_batch_size=False)
node_label = fluid.layers.data(
"node_label",
shape=[None, 1],
dtype="int64",
append_batch_size=False)
pred = fluid.layers.gather(output, node_index)
loss, pred = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=node_label, return_softmax=True)
acc = fluid.layers.accuracy(input=pred, label=node_label, k=1)
loss = fluid.layers.mean(loss)
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.0005),
learning_rate=0.005)
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(startup_program)
feed_dict = gw.to_feed(dataset.graph)
train_index = dataset.train_index
train_label = np.expand_dims(dataset.y[train_index], -1)
train_index = np.expand_dims(train_index, -1)
val_index = dataset.val_index
val_label = np.expand_dims(dataset.y[val_index], -1)
val_index = np.expand_dims(val_index, -1)
test_index = dataset.test_index
test_label = np.expand_dims(dataset.y[test_index], -1)
test_index = np.expand_dims(test_index, -1)
# get beta param
beta_param_list = []
for param in fluid.io.get_program_parameter(train_program):
if param.name.endswith("_beta"):
beta_param_list.append(param)
dur = []
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
feed_dict["node_index"] = np.array(train_index, dtype="int64")
feed_dict["node_label"] = np.array(train_label, dtype="int64")
train_loss, train_acc = exe.run(train_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
for param in beta_param_list:
beta = np.array(fluid.global_scope().find_var(param.name).get_tensor())
writer.add_scalar("beta/"+param.name, beta, epoch)
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
feed_dict["node_index"] = np.array(val_index, dtype="int64")
feed_dict["node_label"] = np.array(val_label, dtype="int64")
val_loss, val_acc = exe.run(test_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
"Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+ "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
feed_dict["node_index"] = np.array(test_index, dtype="int64")
feed_dict["node_label"] = np.array(test_label, dtype="int64")
test_loss, test_acc = exe.run(test_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
log.info("Accuracy: %f" % test_acc)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='DeeperGCN')
parser.add_argument(
"--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
args = parser.parse_args()
log.info(args)
main(args)
......@@ -6,54 +6,32 @@ information (e.g., text attributes) to efficiently generate node embeddings for
For purpose of high scalability, we use redis as distribute graph storage solution and training graphsage against redis server.
### Datasets(Quickstart)
The reddit dataset should be downloaded from [reddit_adj.npz](https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt) and [reddit.npz](https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2Jthe). The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
The reddit dataset should be downloaded from [reddit_adj.npz](https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt) and [reddit.npz](https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J). The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
Alternatively, reddit dataset has been preprocessed and packed into docker image, which can be instantly pulled using following commands.
- reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
- reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
```sh
docker pull githubutilities/reddit_redis_demo:v0.1
```
Download `reddit.npz` and `reddit_adj.npz` into `data` directory for further preprocessing.
### Dependencies
```txt
- paddlepaddle>=1.6
- pgl
- scipy
- redis==2.10.6
- redis-py-cluster==1.3.6
```sh
pip install -r requirements.txt
```
### How to run
#### 1. Start reddit data service
#### 1. Preprocessing and start reddit data service
```sh
docker run \
--net=host \
-d --rm \
--name reddit_demo \
-it githubutilities/reddit_redis_demo:v0.1 \
/bin/bash -c "/bin/bash ./before_hook.sh && /bin/bash"
docker logs -f `docker ps -aqf "name=reddit_demo"`
pushd ./redis_setup
/bin/bash ./before_hook.sh
popd
```
#### 2. training GraphSAGE model
```sh
python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --sample_workers 10
sh ./cloud_run.sh
```
#### Hyperparameters
- epoch: Number of epochs default (10)
- use_cuda: Use gpu if assign use_cuda.
- graphsage_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm".
- sample_workers: The number of workers for multiprocessing subgraph sample.
- lr: Learning rate.
- batch_size: Batch size.
- samples_1: The max neighbors for the first hop neighbor sampling. (default: 25)
- samples_2: The max neighbors for the second hop neighbor sampling. (default: 10)
- hidden_size: The hidden size of the GraphSAGE models.
#!/bin/bash
set -x
mode=${1}
source ./utils.sh
unset http_proxy https_proxy
source ./local_config
if [ ! -d ${log_dir} ]; then
mkdir ${log_dir}
fi
for((i=0;i<${PADDLE_PSERVERS_NUM};i++))
do
echo "start ps server: ${i}"
echo $log_dir
TRAINING_ROLE="PSERVER" PADDLE_TRAINER_ID=${i} sh job.sh &> $log_dir/pserver.$i.log &
done
sleep 10s
for((j=0;j<${PADDLE_TRAINERS_NUM};j++))
do
echo "start ps work: ${j}"
TRAINING_ROLE="TRAINER" PADDLE_TRAINER_ID=${j} sh job.sh &> $log_dir/worker.$j.log &
done
tail -f $log_dir/worker.0.log
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import os
import math
import numpy as np
import paddle.fluid as F
import paddle.fluid.layers as L
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from pgl.utils.logger import log
from model import GraphsageModel
from utils import load_config
import reader
def init_role():
# reset the place according to role of parameter server
training_role = os.getenv("TRAINING_ROLE", "TRAINER")
paddle_role = role_maker.Role.WORKER
place = F.CPUPlace()
if training_role == "PSERVER":
paddle_role = role_maker.Role.SERVER
# set the fleet runtime environment according to configure
ports = os.getenv("PADDLE_PORT", "6174").split(",")
pserver_ips = os.getenv("PADDLE_PSERVERS").split(",") # ip,ip...
eplist = []
if len(ports) > 1:
# local debug mode, multi port
for port in ports:
eplist.append(':'.join([pserver_ips[0], port]))
else:
# distributed mode, multi ip
for ip in pserver_ips:
eplist.append(':'.join([ip, ports[0]]))
pserver_endpoints = eplist # ip:port,ip:port...
worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
role = role_maker.UserDefinedRoleMaker(
current_id=trainer_id,
role=paddle_role,
worker_num=worker_num,
server_endpoints=pserver_endpoints)
fleet.init(role)
def optimization(base_lr, loss, optimizer='adam'):
if optimizer == 'sgd':
optimizer = F.optimizer.SGD(base_lr)
elif optimizer == 'adam':
optimizer = F.optimizer.Adam(base_lr, lazy_mode=True)
else:
raise ValueError
log.info('learning rate:%f' % (base_lr))
#create the DistributeTranspiler configure
config = DistributeTranspilerConfig()
config.sync_mode = False
#config.runtime_split_send_recv = False
config.slice_var_up = False
#create the distributed optimizer
optimizer = fleet.distributed_optimizer(optimizer, config)
optimizer.minimize(loss)
def build_complied_prog(train_program, model_loss):
num_threads = int(os.getenv("CPU_NUM", 10))
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
exec_strategy = F.ExecutionStrategy()
exec_strategy.num_threads = num_threads
#exec_strategy.use_experimental_executor = True
build_strategy = F.BuildStrategy()
build_strategy.enable_inplace = True
#build_strategy.memory_optimize = True
build_strategy.memory_optimize = False
build_strategy.remove_unnecessary_lock = False
if num_threads > 1:
build_strategy.reduce_strategy = F.BuildStrategy.ReduceStrategy.Reduce
compiled_prog = F.compiler.CompiledProgram(
train_program).with_data_parallel(loss_name=model_loss.name)
return compiled_prog
def fake_py_reader(data_iter, num):
def fake_iter():
queue = []
for idx, data in enumerate(data_iter()):
queue.append(data)
if len(queue) == num:
yield queue
queue = []
if len(queue) > 0:
while len(queue) < num:
queue.append(queue[-1])
yield queue
return fake_iter
def train_prog(exe, program, model, pyreader, args):
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
start = time.time()
batch = 0
total_loss = 0.
total_acc = 0.
total_sample = 0
for epoch_idx in range(args.num_epoch):
for step, batch_feed_dict in enumerate(pyreader()):
try:
cpu_time = time.time()
batch += 1
batch_loss, batch_acc = exe.run(
program,
feed=batch_feed_dict,
fetch_list=[model.loss, model.acc])
end = time.time()
if batch % args.log_per_step == 0:
log.info(
"Batch %s Loss %s Acc %s \t Speed(per batch) %.5lf/%.5lf sec"
% (batch, np.mean(batch_loss), np.mean(batch_acc), (end - start) /batch, (end - cpu_time)))
if step % args.steps_per_save == 0:
save_path = args.save_path
if trainer_id == 0:
model_path = os.path.join(save_path, "%s" % step)
fleet.save_persistables(exe, model_path)
except Exception as e:
log.info("Pyreader train error")
log.exception(e)
def main(args):
log.info("start")
worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
num_devices = int(os.getenv("CPU_NUM", 10))
model = GraphsageModel(args)
loss = model.forward()
train_iter = reader.get_iter(args, model.graph_wrapper, 'train')
pyreader = fake_py_reader(train_iter, num_devices)
# init fleet
init_role()
optimization(args.lr, loss, args.optimizer)
# init and run server or worker
if fleet.is_server():
fleet.init_server(args.warm_start_from_dir)
fleet.run_server()
if fleet.is_worker():
log.info("start init worker done")
fleet.init_worker()
#just the worker, load the sample
log.info("init worker done")
exe = F.Executor(F.CPUPlace())
exe.run(fleet.startup_program)
log.info("Startup done")
compiled_prog = build_complied_prog(fleet.main_program, loss)
train_prog(exe, compiled_prog, model, pyreader, args)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='metapath2vec')
parser.add_argument("-c", "--config", type=str, default="./config.yaml")
args = parser.parse_args()
config = load_config(args.config)
log.info(config)
main(config)
# model config
hidden_size: 128
num_class: 41
samples: [25, 10]
graphsage_type: "graphsage_mean"
# trainging config
num_epoch: 10
batch_size: 128
num_sample_workers: 10
optimizer: "adam"
lr: 0.01
warm_start_from_dir: null
steps_per_save: 1000
log_per_step: 1
save_path: "./checkpoints"
log_dir: "./logs"
CPU_NUM: 1
#!/bin/bash
set -x
source ./utils.sh
export CPU_NUM=$CPU_NUM
export FLAGS_rpc_deadline=3000000
export FLAGS_communicator_send_queue_size=1
export FLAGS_communicator_min_send_grad_num_before_recv=0
export FLAGS_communicator_max_merge_var_num=1
export FLAGS_communicator_merge_sparse_grad=0
python -u cluster_train.py -c config.yaml
#!/bin/bash
export PADDLE_TRAINERS_NUM=2
export PADDLE_PSERVERS_NUM=2
export PADDLE_PORT=6184,6185
export PADDLE_PSERVERS="127.0.0.1"
......@@ -11,10 +11,22 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
graphsage model.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import math
import pgl
import numpy as np
import paddle
import paddle.fluid.layers as L
import paddle.fluid as F
import paddle.fluid as fluid
def copy_send(src_feat, dst_feat, edge_feat):
return src_feat["h"]
......@@ -128,3 +140,87 @@ def graphsage_lstm(gw, feature, hidden_size, act, name):
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size):
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int64", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
#feature = fluid.layers.gather(feature, graph_wrapper.node_feat['feats'])
feature = graph_wrapper.node_feat['feats']
feature.stop_gradient = True
for i in range(k_hop):
if graphsage_type == 'graphsage_mean':
feature = graphsage_mean(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_mean_%s" % i)
elif graphsage_type == 'graphsage_meanpool':
feature = graphsage_meanpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_meanpool_%s" % i)
elif graphsage_type == 'graphsage_maxpool':
feature = graphsage_maxpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
elif graphsage_type == 'graphsage_lstm':
feature = graphsage_lstm(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
else:
raise ValueError("graphsage type %s is not"
" implemented" % graphsage_type)
feature = fluid.layers.gather(feature, node_index)
logits = fluid.layers.fc(feature,
num_class,
act=None,
name='classification_layer')
proba = fluid.layers.softmax(logits)
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=node_label)
loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
return loss, acc
class GraphsageModel(object):
def __init__(self, args):
self.args = args
def forward(self):
args = self.args
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph", node_feat=[('feats', [None, 602], np.dtype('float32'))])
loss, acc = build_graph_model(
graph_wrapper,
num_class=args.num_class,
hidden_size=args.hidden_size,
graphsage_type=args.graphsage_type,
k_hop=len(args.samples))
loss.persistable = True
self.graph_wrapper = graph_wrapper
self.loss = loss
self.acc = acc
return loss
......@@ -11,6 +11,8 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import sys
import numpy as np
import pickle as pkl
import paddle
......@@ -147,3 +149,48 @@ def multiprocess_graph_reader(
return reader()
def load_data():
"""
data from https://github.com/matenure/FastGCN/issues/8
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
reddit_index_label is preprocess from reddit.npz without feats key.
"""
data_dir = os.path.dirname(os.path.abspath(__file__))
data = np.load(os.path.join(data_dir, "data/reddit_index_label.npz"))
num_class = 41
train_label = data['y_train']
val_label = data['y_val']
test_label = data['y_test']
train_index = data['train_index']
val_index = data['val_index']
test_index = data['test_index']
return {
"train_index": train_index,
"train_label": train_label,
"val_label": val_label,
"val_index": val_index,
"test_index": test_index,
"test_label": test_label,
"num_class": 41
}
def get_iter(args, graph_wrapper, mode):
data = load_data()
train_iter = multiprocess_graph_reader(
graph_wrapper,
samples=args.samples,
num_workers=args.num_sample_workers,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
return train_iter
if __name__ == '__main__':
for e in train_iter():
print(e)
#!/bin/bash
set -x
srcdir=./src
# Data preprocessing
python ./src/preprocess.py
# Download and compile redis
export PATH=$PWD/redis-5.0.5/src:$PATH
if [ ! -f ./redis.tar.gz ]; then
curl https://codeload.github.com/antirez/redis/tar.gz/5.0.5 -o ./redis.tar.gz
fi
tar -xzf ./redis.tar.gz
cd ./redis-5.0.5/
make
cd -
# Install python deps
python -m pip install -U pip
pip install -r ./src/requirements.txt -U
# Run redis server
sh ./src/run_server.sh
# Dumping data into redis
source ./redis_graph.cfg
sh ./src/dump_data.sh $edge_path $server_list $num_nodes $node_feat_path
exit 0
# dump config
edge_path=../data/edge.txt
node_feat_path=../data/feats.npz
num_nodes=232965
server_list=./server.list
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import json
import logging
from collections import defaultdict
import tqdm
import redis
from redis._compat import b, unicode, bytes, long, basestring
from rediscluster.nodemanager import NodeManager
from rediscluster.crc import crc16
import argparse
import time
import pickle
import numpy as np
import scipy.sparse as sp
log = logging.getLogger(__name__)
root = logging.getLogger()
root.setLevel(logging.DEBUG)
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)
def encode(value):
"""
Return a bytestring representation of the value.
This method is copied from Redis' connection.py:Connection.encode
"""
if isinstance(value, bytes):
return value
elif isinstance(value, (int, long)):
value = b(str(value))
elif isinstance(value, float):
value = b(repr(value))
elif not isinstance(value, basestring):
value = unicode(value)
if isinstance(value, unicode):
value = value.encode('utf-8')
return value
def crc16_hash(data):
return crc16(encode(data))
def get_redis(startup_host, startup_port):
startup_nodes = [{"host": startup_host, "port": startup_port}, ]
nodemanager = NodeManager(startup_nodes=startup_nodes)
nodemanager.initialize()
rs = {}
for node, config in nodemanager.nodes.items():
rs[node] = redis.Redis(
host=config["host"], port=config["port"], decode_responses=False)
return rs, nodemanager
def load_data(edge_path):
src, dst = [], []
with open(edge_path, "r") as f:
for i in tqdm.tqdm(f):
s, d, _ = i.split()
s = int(s)
d = int(d)
src.append(s)
dst.append(d)
dst.append(s)
src.append(d)
src = np.array(src, dtype="int64")
dst = np.array(dst, dtype="int64")
return src, dst
def build_edge_index(edge_path, num_nodes, startup_host, startup_port,
num_bucket):
#src, dst = load_data(edge_path)
rs, nodemanager = get_redis(startup_host, startup_port)
dst_mp, edge_mp = defaultdict(list), defaultdict(list)
with open(edge_path) as f:
for l in tqdm.tqdm(f):
a, b, idx = l.rstrip().split('\t')
a, b, idx = int(a), int(b), int(idx)
dst_mp[a].append(b)
edge_mp[a].append(idx)
part_dst_dicts = {}
for i in tqdm.tqdm(range(num_nodes)):
#if len(edge_index.v[i]) == 0:
# continue
#v = edge_index.v[i].astype("int64").reshape([-1, 1])
#e = edge_index.eid[i].astype("int64").reshape([-1, 1])
if i not in dst_mp:
continue
v = np.array(dst_mp[i]).astype('int64').reshape([-1, 1])
e = np.array(edge_mp[i]).astype('int64').reshape([-1, 1])
o = np.hstack([v, e])
key = "d:%s" % i
part = crc16_hash(key) % num_bucket
if part not in part_dst_dicts:
part_dst_dicts[part] = {}
dst_dicts = part_dst_dicts[part]
dst_dicts["d:%s" % i] = o.tobytes()
if len(dst_dicts) > 10000:
slot = nodemanager.keyslot("part-%s" % part)
node = nodemanager.slots[slot][0]['name']
while True:
res = rs[node].hmset("part-%s" % part, dst_dicts)
if res:
break
log.info("HMSET FAILED RETRY connected %s" % node)
time.sleep(1)
part_dst_dicts[part] = {}
for part, dst_dicts in part_dst_dicts.items():
if len(dst_dicts) > 0:
slot = nodemanager.keyslot("part-%s" % part)
node = nodemanager.slots[slot][0]['name']
while True:
res = rs[node].hmset("part-%s" % part, dst_dicts)
if res:
break
log.info("HMSET FAILED RETRY connected %s" % node)
time.sleep(1)
part_dst_dicts[part] = {}
log.info("dst_dict Done")
def build_edge_id(edge_path, num_nodes, startup_host, startup_port,
num_bucket):
src, dst = load_data(edge_path)
rs, nodemanager = get_redis(startup_host, startup_port)
part_edge_dict = {}
for i in tqdm.tqdm(range(len(src))):
key = "e:%s" % i
part = crc16_hash(key) % num_bucket
if part not in part_edge_dict:
part_edge_dict[part] = {}
edge_dict = part_edge_dict[part]
edge_dict["e:%s" % i] = int(src[i]) * num_nodes + int(dst[i])
if len(edge_dict) > 10000:
slot = nodemanager.keyslot("part-%s" % part)
node = nodemanager.slots[slot][0]['name']
while True:
res = rs[node].hmset("part-%s" % part, edge_dict)
if res:
break
log.info("HMSET FAILED RETRY connected %s" % node)
time.sleep(1)
part_edge_dict[part] = {}
for part, edge_dict in part_edge_dict.items():
if len(edge_dict) > 0:
slot = nodemanager.keyslot("part-%s" % part)
node = nodemanager.slots[slot][0]['name']
while True:
res = rs[node].hmset("part-%s" % part, edge_dict)
if res:
break
log.info("HMSET FAILED RETRY connected %s" % node)
time.sleep(1)
part_edge_dict[part] = {}
def build_infos(edge_path, num_nodes, startup_host, startup_port, num_bucket):
src, dst = load_data(edge_path)
rs, nodemanager = get_redis(startup_host, startup_port)
slot = nodemanager.keyslot("num_nodes")
node = nodemanager.slots[slot][0]['name']
res = rs[node].set("num_nodes", num_nodes)
slot = nodemanager.keyslot("num_edges")
node = nodemanager.slots[slot][0]['name']
rs[node].set("num_edges", len(src))
slot = nodemanager.keyslot("nf:infos")
node = nodemanager.slots[slot][0]['name']
rs[node].set("nf:infos", json.dumps([['feats', [-1, 602], 'float32'], ]))
slot = nodemanager.keyslot("ef:infos")
node = nodemanager.slots[slot][0]['name']
rs[node].set("ef:infos", json.dumps([]))
def build_node_feat(node_feat_path, num_nodes, startup_host, startup_port, num_bucket):
assert node_feat_path != "", "node_feat_path empty!"
feat_dict = np.load(node_feat_path)
for k in feat_dict.keys():
feat = feat_dict[k]
assert feat.shape[0] == num_nodes, "num_nodes invalid"
rs, nodemanager = get_redis(startup_host, startup_port)
part_feat_dict = {}
for k in feat_dict.keys():
feat = feat_dict[k]
for i in tqdm.tqdm(range(num_nodes)):
key = "nf:%s:%i" % (k, i)
value = feat[i].tobytes()
part = crc16_hash(key) % num_bucket
if part not in part_feat_dict:
part_feat_dict[part] = {}
part_feat = part_feat_dict[part]
part_feat[key] = value
if len(part_feat) > 100:
slot = nodemanager.keyslot("part-%s" % part)
node = nodemanager.slots[slot][0]['name']
while True:
res = rs[node].hmset("part-%s" % part, part_feat)
if res:
break
log.info("HMSET FAILED RETRY connected %s" % node)
time.sleep(1)
part_feat_dict[part] = {}
for part, part_feat in part_feat_dict.items():
if len(part_feat) > 0:
slot = nodemanager.keyslot("part-%s" % part)
node = nodemanager.slots[slot][0]['name']
while True:
res = rs[node].hmset("part-%s" % part, part_feat)
if res:
break
log.info("HMSET FAILED RETRY connected %s" % node)
time.sleep(1)
part_feat_dict[part] = {}
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='gen_redis_conf')
parser.add_argument('--startup_port', type=int, required=True)
parser.add_argument('--startup_host', type=str, required=True)
parser.add_argument('--edge_path', type=str, default="")
parser.add_argument('--node_feat_path', type=str, default="")
parser.add_argument('--num_nodes', type=int, default=0)
parser.add_argument('--num_bucket', type=int, default=64)
parser.add_argument(
'--mode',
type=str,
required=True,
help="choose one of the following modes (clear, edge_index, edge_id, graph_attr)"
)
args = parser.parse_args()
log.info("Mode: {}".format(args.mode))
if args.mode == 'edge_index':
build_edge_index(args.edge_path, args.num_nodes, args.startup_host,
args.startup_port, args.num_bucket)
elif args.mode == 'edge_id':
build_edge_id(args.edge_path, args.num_nodes, args.startup_host,
args.startup_port, args.num_bucket)
elif args.mode == 'graph_attr':
build_infos(args.edge_path, args.num_nodes, args.startup_host,
args.startup_port, args.num_bucket)
elif args.mode == 'node_feat':
build_node_feat(args.node_feat_path, args.num_nodes, args.startup_host,
args.startup_port, args.num_bucket)
else:
raise ValueError("%s mode not found" % args.mode)
filter(){
lines=`cat $1`
rm $1
for line in $lines; do
remote_host=`echo $line | cut -d":" -f1`
remote_port=`echo $line | cut -d":" -f2`
nc -z $remote_host $remote_port
if [[ $? == 0 ]]; then
echo $line >> $1
fi
done
}
dump_data(){
filter $server_list
python ./src/start_cluster.py --server_list $server_list --replicas 0
address=`head -n 1 $server_list`
ip=`echo $address | cut -d":" -f1`
port=`echo $address | cut -d":" -f2`
python ./src/build_graph.py --startup_host $ip \
--startup_port $port \
--mode node_feat \
--node_feat_path $feat_fn \
--num_nodes $num_nodes
# build edge index
python ./src/build_graph.py --startup_host $ip \
--startup_port $port \
--mode edge_index \
--edge_path $edge_path \
--num_nodes $num_nodes
# build edge id
#python ./src/build_graph.py --startup_host $ip \
# --startup_port $port \
# --mode edge_id \
# --edge_path $edge_path \
# --num_nodes $num_nodes
# build graph attr
python ./src/build_graph.py --startup_host $ip \
--startup_port $port \
--mode graph_attr \
--edge_path $edge_path \
--num_nodes $num_nodes
}
if [ $# -ne 4 ]; then
echo 'sh edge_path server_list num_nodes feat_fn'
exit
fi
num_nodes=$3
server_list=$2
edge_path=$1
feat_fn=$4
dump_data
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import sys
import socket
import argparse
import os
temp = """port %s
bind %s
daemonize yes
pidfile /var/run/redis_%s.pid
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 50000
logfile "redis.log"
appendonly yes"""
def gen_config(ports):
if len(ports) == 0:
raise ValueError("No ports")
ip = socket.gethostbyname(socket.gethostname())
print("Generate redis conf")
for port in ports:
try:
os.mkdir("%s" % port)
except:
print("port %s directory already exists" % port)
pass
with open("%s/redis.conf" % port, 'w') as f:
f.write(temp % (port, ip, port))
print("Generate Start Server Scripts")
with open("start_server.sh", "w") as f:
f.write("set -x\n")
for ind, port in enumerate(ports):
f.write("# %s %s start\n" % (ip, port))
if ind > 0:
f.write("cd ..\n")
f.write("cd %s\n" % port)
f.write("redis-server redis.conf\n")
f.write("\n")
print("Generate Stop Server Scripts")
with open("stop_server.sh", "w") as f:
f.write("set -x\n")
for ind, port in enumerate(ports):
f.write("# %s %s shutdown\n" % (ip, port))
f.write("redis-cli -h %s -p %s shutdown\n" % (ip, port))
f.write("\n")
with open("server.list", "w") as f:
for ind, port in enumerate(ports):
f.write("%s:%s\n" % (ip, port))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='gen_redis_conf')
parser.add_argument('--ports', nargs='+', type=int, default=[])
args = parser.parse_args()
gen_config(args.ports)
import os
import sys
import numpy as np
import scipy.sparse as sp
def _load_config(fn):
ret = {}
with open(fn) as f:
for l in f:
if l.strip() == '' or l.startswith('#'):
continue
k, v = l.strip().split('=')
ret[k] = v
return ret
def _prepro(config):
data = np.load("../data/reddit.npz")
adj = sp.load_npz("../data/reddit_adj.npz")
adj = adj.tocoo()
src = adj.row
dst = adj.col
with open(config['edge_path'], 'w') as f:
for idx, e in enumerate(zip(src, dst)):
s, d = e
l = "{}\t{}\t{}\n".format(s, d, idx)
f.write(l)
feats = data['feats'].astype(np.float32)
np.savez(config['node_feat_path'], feats=feats)
if __name__ == '__main__':
config = _load_config('./redis_graph.cfg')
_prepro(config)
numpy
scipy
tqdm
redis==2.10.6
redis-py-cluster==1.3.6
start_server(){
ports=""
for i in {7430..7439}; do
nc -z localhost $i
if [[ $? != 0 ]]; then
ports="$ports $i"
fi
done
python ./src/gen_redis_conf.py --ports $ports
bash ./start_server.sh #启动服务器
}
start_server
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
def build_clusters(server_list, replicas):
servers = []
with open(server_list) as f:
for line in f:
servers.append(line.strip())
cmd = "echo yes | redis-cli --cluster create"
for server in servers:
cmd += ' %s ' % server
cmd += '--cluster-replicas %s' % replicas
print(cmd)
os.system(cmd)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='start_cluster')
parser.add_argument('--server_list', type=str, required=True)
parser.add_argument('--replicas', type=int, default=0)
args = parser.parse_args()
build_clusters(args.server_list, args.replicas)
#!/bin/bash
source ./redis_graph.cfg
url=`head -n1 $server_list`
shuf $edge_path | head -n 1000 | python ./test/test_redis_graph.py $url
#!/usr/bin/env python
# -*- coding: utf-8 -*-
########################################################################
#
# Copyright (c) 2019 Baidu.com, Inc. All Rights Reserved
#
# File: test_redis_graph.py
# Author: suweiyue(suweiyue@baidu.com)
# Date: 2019/08/19 16:28:18
#
########################################################################
"""
Comment.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import sys
import numpy as np
import tqdm
from pgl.redis_graph import RedisGraph
if __name__ == '__main__':
host, port = sys.argv[1].split(':')
port = int(port)
redis_configs = [{"host": host, "port": port}, ]
graph = RedisGraph("reddit-graph", redis_configs, num_parts=64)
#nodes = np.arange(0, 100)
#for i in range(0, 100):
for l in tqdm.tqdm(sys.stdin):
l_sp = l.rstrip().split('\t')
if len(l_sp) != 2:
continue
i, j = int(l_sp[0]), int(l_sp[1])
nodes = graph.sample_predecessor(np.array([i]), 10000)
assert j in nodes
pgl==1.1.0
pyyaml
paddlepaddle==1.6.1
scipy
redis==2.10.6
redis-py-cluster==1.3.6
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import time
import numpy as np
import scipy.sparse as sp
from sklearn.preprocessing import StandardScaler
import pgl
from pgl.utils.logger import log
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import reader
from model import graphsage_mean, graphsage_meanpool,\
graphsage_maxpool, graphsage_lstm
def load_data():
"""
data from https://github.com/matenure/FastGCN/issues/8
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
reddit_index_label is preprocess from reddit.npz without feats key.
"""
data_dir = os.path.dirname(os.path.abspath(__file__))
data = np.load(os.path.join(data_dir, "data/reddit_index_label.npz"))
num_class = 41
train_label = data['y_train']
val_label = data['y_val']
test_label = data['y_test']
train_index = data['train_index']
val_index = data['val_index']
test_index = data['test_index']
return {
"train_index": train_index,
"train_label": train_label,
"val_label": val_label,
"val_index": val_index,
"test_index": test_index,
"test_label": test_label,
"num_class": 41
}
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size):
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int64", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
#feature = fluid.layers.gather(feature, graph_wrapper.node_feat['feats'])
feature = graph_wrapper.node_feat['feats']
feature.stop_gradient = True
for i in range(k_hop):
if graphsage_type == 'graphsage_mean':
feature = graphsage_mean(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_mean_%s" % i)
elif graphsage_type == 'graphsage_meanpool':
feature = graphsage_meanpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_meanpool_%s" % i)
elif graphsage_type == 'graphsage_maxpool':
feature = graphsage_maxpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
elif graphsage_type == 'graphsage_lstm':
feature = graphsage_lstm(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
else:
raise ValueError("graphsage type %s is not"
" implemented" % graphsage_type)
feature = fluid.layers.gather(feature, node_index)
logits = fluid.layers.fc(feature,
num_class,
act=None,
name='classification_layer')
proba = fluid.layers.softmax(logits)
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=node_label)
loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
return loss, acc
def run_epoch(batch_iter,
exe,
program,
prefix,
model_loss,
model_acc,
epoch,
log_per_step=100):
batch = 0
total_loss = 0.
total_acc = 0.
total_sample = 0
start = time.time()
for batch_feed_dict in batch_iter():
batch += 1
batch_loss, batch_acc = exe.run(program,
fetch_list=[model_loss, model_acc],
feed=batch_feed_dict)
if batch % log_per_step == 0:
log.info("Batch %s %s-Loss %s %s-Acc %s" %
(batch, prefix, batch_loss, prefix, batch_acc))
num_samples = len(batch_feed_dict["node_index"])
total_loss += batch_loss * num_samples
total_acc += batch_acc * num_samples
total_sample += num_samples
end = time.time()
log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
(prefix, epoch, total_loss / total_sample,
total_acc / total_sample, (end - start) / batch))
def main(args):
data = load_data()
log.info("preprocess finish")
log.info("Train Examples: %s" % len(data["train_index"]))
log.info("Val Examples: %s" % len(data["val_index"]))
log.info("Test Examples: %s" % len(data["test_index"]))
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
samples = []
if args.samples_1 > 0:
samples.append(args.samples_1)
if args.samples_2 > 0:
samples.append(args.samples_2)
with fluid.program_guard(train_program, startup_program):
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph", node_feat=[('feats', [None, 602], np.dtype('float32'))])
model_loss, model_acc = build_graph_model(
graph_wrapper,
num_class=data["num_class"],
hidden_size=args.hidden_size,
graphsage_type=args.graphsage_type,
k_hop=len(samples))
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=args.lr)
adam.minimize(model_loss)
exe = fluid.Executor(place)
exe.run(startup_program)
train_iter = reader.multiprocess_graph_reader(
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
val_iter = reader.multiprocess_graph_reader(
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
test_iter = reader.multiprocess_graph_reader(
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
for epoch in range(args.epoch):
run_epoch(
train_iter,
program=train_program,
exe=exe,
prefix="train",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=1,
epoch=epoch)
run_epoch(
val_iter,
program=test_program,
exe=exe,
prefix="val",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
run_epoch(
test_iter,
program=test_program,
prefix="test",
exe=exe,
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='graphsage')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
"--normalize", action='store_true', help="normalize features")
parser.add_argument(
"--symmetry", action='store_true', help="undirect graph")
parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
parser.add_argument("--sample_workers", type=int, default=10)
parser.add_argument("--epoch", type=int, default=10)
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--lr", type=float, default=0.01)
parser.add_argument("--samples_1", type=int, default=25)
parser.add_argument("--samples_2", type=int, default=10)
args = parser.parse_args()
log.info(args)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Implementation of some helper functions"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import os
import time
import yaml
import numpy as np
from pgl.utils.logger import log
class AttrDict(dict):
"""Attr dict
"""
def __init__(self, d):
self.dict = d
def __getattr__(self, attr):
value = self.dict[attr]
if isinstance(value, dict):
return AttrDict(value)
else:
return value
def __str__(self):
return str(self.dict)
def load_config(config_file):
"""Load config file"""
with open(config_file) as f:
if hasattr(yaml, 'FullLoader'):
config = yaml.load(f, Loader=yaml.FullLoader)
else:
config = yaml.load(f)
return AttrDict(config)
# parse yaml file
function parse_yaml {
local prefix=$2
local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
sed -ne "s|^\($s\):|\1|" \
-e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
-e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" $1 |
awk -F$fs '{
indent = length($1)/2;
vname[indent] = $2;
for (i in vname) {if (i > indent) {delete vname[i]}}
if (length($3) > 0) {
vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, $2, $3);
}
}'
}
eval $(parse_yaml "$(dirname "${BASH_SOURCE}")"/config.yaml)
......@@ -31,7 +31,7 @@ final_fc: true
final_l2_norm: true
loss_type: "hinge"
margin: 0.3
neg_type: "random_neg"
neg_type: "batch_neg"
# infer config ------
infer_model: "./output/last"
......@@ -49,7 +49,7 @@ ernie_config:
max_position_embeddings: 513
num_attention_heads: 12
num_hidden_layers: 12
sent_type_vocab_size: 4
sent_type_vocab_size: 2
task_type_vocab_size: 3
vocab_size: 18000
use_task_id: false
......
......@@ -31,7 +31,7 @@ final_fc: true
final_l2_norm: true
loss_type: "hinge"
margin: 0.3
neg_type: "random_neg"
neg_type: "batch_neg"
# infer config ------
infer_model: "./output/last"
......@@ -49,7 +49,7 @@ ernie_config:
max_position_embeddings: 513
num_attention_heads: 12
num_hidden_layers: 12
sent_type_vocab_size: 4
sent_type_vocab_size: 2
task_type_vocab_size: 3
vocab_size: 18000
use_task_id: false
......
......@@ -24,7 +24,7 @@ from pgl.sample import edge_hash
class GraphGenerator(BaseDataGenerator):
def __init__(self, graph_wrappers, data, batch_size, samples,
num_workers, feed_name_list, use_pyreader,
phase, graph_data_path, shuffle=True, buf_size=1000):
phase, graph_data_path, shuffle=True, buf_size=1000, neg_type="batch_neg"):
super(GraphGenerator, self).__init__(
buf_size=buf_size,
......@@ -40,6 +40,7 @@ class GraphGenerator(BaseDataGenerator):
self.phase = phase
self.load_graph(graph_data_path)
self.num_layers = len(graph_wrappers)
self.neg_type= neg_type
def load_graph(self, graph_data_path):
self.graph = pgl.graph.MemmapGraph(graph_data_path)
......@@ -72,7 +73,11 @@ class GraphGenerator(BaseDataGenerator):
batch_src = np.array(batch_src, dtype="int64")
batch_dst = np.array(batch_dst, dtype="int64")
sampled_batch_neg = alias_sample(batch_dst.shape, self.alias, self.events)
if self.neg_type == "batch_neg":
neg_shape = [1]
else:
neg_shape = batch_dst.shape
sampled_batch_neg = alias_sample(neg_shape, self.alias, self.events)
if len(batch_neg) > 0:
batch_neg = np.concatenate([batch_neg, sampled_batch_neg], 0)
......@@ -80,6 +85,7 @@ class GraphGenerator(BaseDataGenerator):
batch_neg = sampled_batch_neg
if self.phase == "train":
#ignore_edges = np.concatenate([np.stack([batch_src, batch_dst], 1), np.stack([batch_dst, batch_src], 1)], 0)
ignore_edges = set()
else:
ignore_edges = set()
......@@ -99,7 +105,7 @@ class GraphGenerator(BaseDataGenerator):
feed_dict["user_index"] = np.array(sub_src_idx, dtype="int64")
feed_dict["item_index"] = np.array(sub_dst_idx, dtype="int64")
feed_dict["neg_item_index"] = np.array(sub_neg_idx, dtype="int64")
feed_dict["term_ids"] = self.term_ids[subgraphs[0].node_feat["index"]]
feed_dict["term_ids"] = self.term_ids[subgraphs[0].node_feat["index"]].astype(np.int64)
return feed_dict
def __call__(self):
......
......@@ -59,8 +59,7 @@ def run_predict(py_reader,
log_per_step=1,
args=None):
if args.input_type == "text":
id2str = np.load(os.path.join(args.graph_path, "id2str.npy"), mmap_mode="r")
id2str = io.open(os.path.join(args.graph_path, "terms.txt"), encoding=args.encoding).readlines()
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
trainer_count = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
......@@ -82,7 +81,7 @@ def run_predict(py_reader,
for ufs, _, sri in zip(batch_usr_feat, batch_ad_feat, batch_src_real_index):
if args.input_type == "text":
sri = id2str[int(sri)]
sri = id2str[int(sri)].strip("\n")
line = "{}\t{}\n".format(sri, tostr(ufs))
fout.write(line)
......
......@@ -17,6 +17,7 @@ role = os.getenv("TRAINING_ROLE", "TRAINER")
import numpy as np
from pgl.utils.logger import log
from pgl.utils.log_writer import LogWriter
import paddle.fluid as F
import paddle.fluid.layers as L
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import StrategyFactory
......@@ -25,7 +26,6 @@ from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerCo
from paddle.fluid.incubate.fleet.collective import fleet as cfleet
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet as tfleet
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from tensorboardX import SummaryWriter
from paddle.fluid.transpiler.distribute_transpiler import DistributedMode
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler.distributed_strategy import TrainerRuntimeConfig
......@@ -77,7 +77,7 @@ class Learner(object):
start = time.time()
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
if trainer_id == 0:
writer = SummaryWriter(os.path.join(self.config.output_path, "train_history"))
writer = LogWriter(os.path.join(self.config.output_path, "train_history"))
for epoch_idx in range(self.config.epoch):
for idx, batch_feed_dict in enumerate(self.model.data_loader()):
......
......@@ -191,12 +191,12 @@ def all_gather(X):
for i in range(trainer_num):
copy_X = X * 1
copy_X = L.collective._broadcast(copy_X, i, True)
copy_X.stop_gradients=True
copy_X.stop_gradient=True
Xs.append(copy_X)
if len(Xs) > 1:
Xs=L.concat(Xs, 0)
Xs.stop_gradients=True
Xs.stop_gradient=True
else:
Xs = Xs[0]
return Xs
......
......@@ -104,7 +104,7 @@ class ErnieModel(object):
zero = L.fill_constant([1], dtype='int64', value=0)
input_mask = L.logical_not(L.equal(src_ids,
zero)) # assume pad id == 0
input_mask = L.cast(input_mask, 'float')
input_mask = L.cast(input_mask, 'float32')
input_mask.stop_gradient = True
return input_mask
......@@ -342,7 +342,7 @@ class ErnieGraphModel(ErnieModel):
L.range(
0, slot_seqlen, 1, dtype='int32'), [1, slot_seqlen, 1],
inplace=True) # [1, slot_seqlen, 1]
a_position_ids = L.expand(a_position_ids, [src_batch, 1, 1]) # [B, slot_seqlen * num_b, 1]
a_position_ids = L.expand(a_position_ids, [src_batch, 1, 1]) # [B, slot_seqlen, 1]
zero = L.fill_constant([1], dtype='int64', value=0)
input_mask = L.cast(L.equal(src_ids[:, :slot_seqlen], zero), "int32") # assume pad id == 0 [B, slot_seqlen, 1]
......
......@@ -455,18 +455,6 @@ def graph_encoder(enc_input,
attn_bias = build_graph_attn_bias(input_mask, n_head, enc_input.dtype, slot_seqlen)
#attn_bias = build_attn_bias(input_mask, n_head, enc_input.dtype)
# d_batch = d_shape[0]
# d_seqlen = d_shape[1]
# pad_idx = L.where(
# L.cast(L.reshape(input_mask, [d_batch, d_seqlen]), 'bool'))
# attn_bias = L.matmul(
# input_mask, input_mask, transpose_y=True) # [batch, seq, seq]
# attn_bias = (1. - attn_bias) * -10000.
# attn_bias = L.stack([attn_bias] * n_head, 1)
# if attn_bias.dtype != enc_input.dtype:
# attn_bias = L.cast(attn_bias, enc_input.dtype)
def to_2d(t_3d):
t_2d = L.gather_nd(t_3d, pad_idx)
return t_2d
......
......@@ -27,7 +27,7 @@ class ErnieSageV2(BaseNet):
src_position_ids = L.expand(src_position_ids, [src_batch, 1, 1]) # [B, slot_seqlen * num_b, 1]
zero = L.fill_constant([1], dtype='int64', value=0)
input_mask = L.cast(L.equal(src_ids, zero), "int32") # assume pad id == 0 [B, slot_seqlen, 1]
src_pad_len = L.reduce_sum(input_mask, 1) # [B, 1, 1]
src_pad_len = L.reduce_sum(input_mask, 1, keep_dim=True) # [B, 1, 1]
dst_position_ids = L.reshape(
L.range(
......@@ -81,14 +81,16 @@ class ErnieSageV2(BaseNet):
self_feature = L.fc(self_feature,
hidden_size,
act=act,
param_attr=F.ParamAttr(name=name + "_l",
param_attr=F.ParamAttr(name=name + "_l.w_0",
learning_rate=learning_rate),
bias_attr=name+"_l.b_0"
)
neigh_feature = L.fc(neigh_feature,
hidden_size,
act=act,
param_attr=F.ParamAttr(name=name + "_r",
learning_rate=learning_rate),
param_attr=F.ParamAttr(name=name + "_r.w_0",
learning_rate=learning_rate),
bias_attr=name+"_r.b_0"
)
output = L.concat([self_feature, neigh_feature], axis=1)
output = L.l2_normalize(output, axis=1)
......
......@@ -24,7 +24,6 @@ from models.message_passing import copy_send
class ErnieSageV3(BaseNet):
def __init__(self, config):
super(ErnieSageV3, self).__init__(config)
self.config.layer_type = "ernie_recv_sum"
def build_inputs(self):
inputs = super(ErnieSageV3, self).build_inputs()
......@@ -35,11 +34,10 @@ class ErnieSageV3(BaseNet):
def gnn_layer(self, gw, feature, hidden_size, act, initializer, learning_rate, name):
def ernie_recv(feat):
"""doc"""
# TODO maxlen 400
#pad_value = L.cast(L.assign(input=np.array([0], dtype=np.int32)), "int64")
num_neighbor = self.config.samples[0]
pad_value = L.zeros([1], "int64")
out, _ = L.sequence_pad(feat, pad_value=pad_value, maxlen=10)
out = L.reshape(out, [0, 400])
out, _ = L.sequence_pad(feat, pad_value=pad_value, maxlen=num_neighbor)
out = L.reshape(out, [0, self.config.max_seqlen*num_neighbor])
return out
def erniesage_v3_aggregator(gw, feature, hidden_size, act, initializer, learning_rate, name):
......@@ -73,7 +71,7 @@ class ErnieSageV3(BaseNet):
act,
initializer,
learning_rate=fc_lr,
name="%s_%s" % (self.config.layer_type, i))
name="%s_%s" % ("erniesage_v3", i))
features.append(feature)
return features
......@@ -85,17 +83,16 @@ class ErnieSageV3(BaseNet):
ernie = ErnieGraphModel(
src_ids=feat,
config=ernie_config,
slot_seqlen=self.config.max_seqlen,
name="student_")
slot_seqlen=self.config.max_seqlen)
feat = ernie.get_pooled_output()
fc_lr = self.config.lr / 0.001
feat= L.fc(feat,
self.config.hidden_size,
act="relu",
param_attr=F.ParamAttr(name=name + "_l",
learning_rate=fc_lr),
)
feat = L.l2_normalize(feat, axis=1)
# feat = L.fc(feat,
# self.config.hidden_size,
# act="relu",
# param_attr=F.ParamAttr(name=name + "_l",
# learning_rate=fc_lr),
# )
#feat = L.l2_normalize(feat, axis=1)
if self.config.final_fc:
feat = L.fc(feat,
......
......@@ -57,14 +57,16 @@ def graphsage_sum(gw, feature, hidden_size, act, initializer, learning_rate, nam
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
param_attr=fluid.ParamAttr(name=name + "_l", initializer=initializer,
param_attr=fluid.ParamAttr(name=name + "_l.w_0", initializer=initializer,
learning_rate=learning_rate),
bias_attr=name+"_l.b_0"
)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
param_attr=fluid.ParamAttr(name=name + "_r", initializer=initializer,
param_attr=fluid.ParamAttr(name=name + "_r.w_0", initializer=initializer,
learning_rate=learning_rate),
bias_attr=name+"_r.b_0"
)
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
......@@ -79,14 +81,16 @@ def graphsage_mean(gw, feature, hidden_size, act, initializer, learning_rate, na
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
param_attr=fluid.ParamAttr(name=name + "_l", initializer=initializer,
param_attr=fluid.ParamAttr(name=name + "_l.w_0", initializer=initializer,
learning_rate=learning_rate),
bias_attr=name+"_l.b_0"
)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
param_attr=fluid.ParamAttr(name=name + "_r", initializer=initializer,
param_attr=fluid.ParamAttr(name=name + "_r.w_0", initializer=initializer,
learning_rate=learning_rate),
bias_attr=name+"_r.b_0"
)
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
......@@ -101,14 +105,16 @@ def pinsage_mean(gw, feature, hidden_size, act, initializer, learning_rate, name
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
param_attr=fluid.ParamAttr(name=name + "_l", initializer=initializer,
param_attr=fluid.ParamAttr(name=name + "_l.w_0", initializer=initializer,
learning_rate=learning_rate),
bias_attr=name+"_l.b_0"
)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
param_attr=fluid.ParamAttr(name=name + "_r", initializer=initializer,
param_attr=fluid.ParamAttr(name=name + "_r.w_0", initializer=initializer,
learning_rate=learning_rate),
bias_attr=name+"_r.b_0"
)
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
......@@ -123,14 +129,16 @@ def pinsage_sum(gw, feature, hidden_size, act, initializer, learning_rate, name)
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
param_attr=fluid.ParamAttr(name=name + "_l", initializer=initializer,
param_attr=fluid.ParamAttr(name=name + "_l.w_0", initializer=initializer,
learning_rate=learning_rate),
bias_attr=name+"_l.b_0"
)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
param_attr=fluid.ParamAttr(name=name + "_r", initializer=initializer,
param_attr=fluid.ParamAttr(name=name + "_r.w_0", initializer=initializer,
learning_rate=learning_rate),
bias_attr=name+"_r.b_0"
)
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
......
......@@ -36,7 +36,7 @@ from tokenization import FullTokenizer
def term2id(string, tokenizer, max_seqlen):
string = string.split("\t")[1]
#string = string.split("\t")[1]
tokens = tokenizer.tokenize(string)
ids = tokenizer.convert_tokens_to_ids(tokens)
ids = ids[:max_seqlen-1]
......@@ -99,19 +99,13 @@ def dump_graph(args):
np.save(os.path.join(args.outpath, "neg_samples.npy"), np.array(neg_samples))
log.info("End Build Graph")
def dump_id2str_map(args):
log.info("Dump id2str map starting...")
id2str = np.array([line.strip("\n") for line in open(os.path.join(args.outpath, "terms.txt"), "r", encoding=args.encoding)])
np.save(os.path.join(args.outpath, "id2str.npy"), id2str)
log.info("Dump id2str map done.")
def dump_node_feat(args):
log.info("Dump node feat starting...")
id2str = np.load(os.path.join(args.outpath, "id2str.npy"), mmap_mode="r")
id2str = [line.strip("\n").split("\t")[1] for line in io.open(os.path.join(args.outpath, "terms.txt"), encoding=args.encoding)]
pool = multiprocessing.Pool()
tokenizer = FullTokenizer(args.vocab_file)
term_ids = pool.map(partial(term2id, tokenizer=tokenizer, max_seqlen=args.max_seqlen), id2str)
np.save(os.path.join(args.outpath, "term_ids.npy"), np.array(term_ids))
np.save(os.path.join(args.outpath, "term_ids.npy"), np.array(term_ids, np.uint16))
log.info("Dump node feat done.")
pool.terminate()
......@@ -124,5 +118,4 @@ if __name__ == "__main__":
parser.add_argument("-o", "--outpath", type=str, default=None)
args = parser.parse_args()
dump_graph(args)
dump_id2str_map(args)
dump_node_feat(args)
......@@ -32,8 +32,9 @@ class TrainData(object):
trainer_count = int(os.getenv("PADDLE_TRAINERS_NUM", "1"))
log.info("trainer_id: %s, trainer_count: %s." % (trainer_id, trainer_count))
edges = np.load(os.path.join(graph_path, "edges.npy"), allow_pickle=True)
bidirectional_edges = np.load(os.path.join(graph_path, "edges.npy"), allow_pickle=True)
# edges is bidirectional.
edges = bidirectional_edges[0::2]
train_usr = edges[trainer_id::trainer_count, 0]
train_ad = edges[trainer_id::trainer_count, 1]
returns = {
......@@ -73,7 +74,8 @@ def main(config):
use_pyreader=config.use_pyreader,
phase="train",
graph_data_path=config.graph_path,
shuffle=True)
shuffle=True,
neg_type=config.neg_type)
log.info("build graph reader done.")
......
......@@ -23,7 +23,7 @@ You can make your customized dataset by the following format:
For examples, use gpu to train STGCN on your dataset.
```
python main.py --use_cuda --input_file dataset/input_csv --label_file dataset/output.csv --adj_mat_file dataset/W.csv --city_file dataset/city.csv
python main.py --use_cuda --input_file dataset/input.csv --label_file dataset/output.csv --adj_mat_file dataset/W.csv --city_file dataset/city.csv
```
#### Hyperparameters
......
......@@ -167,9 +167,6 @@ def data_gen_mydata(input_file, label_file, n, n_his, n_pred, n_config):
x = x.drop(columns=['date'])
y = y.drop(columns=['date'])
x = x.drop(columns=['武汉'])
y = y.drop(columns=['武汉'])
# param
n_val, n_test = n_config
n_train = len(y) - n_val - n_test - 2
......
0,3409,2025,509,13098
2404,0,2207,3654,9485
21926,18619,0,955,1308
20160,12493,170,0,1906
611,572,1204,1066,0
num,city
0,A
1,B
2,C
3,D
4,E
date,A,B,C,D,E
2327/1/1,178,3907,2907,1170,832
2327/1/2,220,2720,2548,1370,1039
2327/1/3,222,5065,4286,2051,1582
2327/1/4,183,5291,4626,2096,1614
2327/1/5,172,3916,3538,1726,1349
2327/1/6,219,4079,4110,2044,1701
2327/1/7,220,4707,4673,2589,2177
2327/1/8,222,5306,5512,3015,2463
2327/1/9,215,5762,5802,3184,2558
2327/1/10,217,4977,4641,2659,2185
2327/1/11,186,6849,6106,3092,2310
2327/1/12,175,5953,4986,2521,1769
2327/1/13,215,5270,4983,2559,1818
2327/1/14,213,5304,5307,2516,1707
2327/1/15,205,5499,5684,2659,1633
2327/1/16,205,5811,6531,2920,1793
2327/1/17,222,6397,7745,3159,2036
2327/1/18,253,7759,9681,4011,2331
2327/1/19,859,8791,8215,4507,2480
2327/1/20,837,10348,9960,5655,3167
2327/1/21,931,12782,13621,7107,4291
2327/1/22,1048,15298,16222,8206,4730
2327/1/23,835,16287,14803,6504,3679
2327/1/24,635,4806,3970,1551,816
2327/1/25,511,1028,1023,401,205
2327/1/26,387,483,632,249,111
2327/1/27,459,457,591,209,126
2327/1/28,1073,513,707,234,176
2327/1/29,1301,651,932,276,264
2327/1/30,1502,757,1266,369,302
2327/1/31,1823,972,1286,490,487
2327/2/1,2219,1113,1594,579,548
2327/2/2,2719,1345,2172,695,703
2327/2/3,3563,1556,2517,931,823
2327/2/4,4335,1824,2837,1095,928
2327/2/5,5568,2343,3323,1244,1043
2327/2/6,6070,2917,3420,1295,1054
2327/2/7,7169,3278,3758,1516,1185
2327/2/8,8284,3616,3982,1639,1333
2327/2/9,9229,3799,4200,1726,1418
2327/2/10,10425,3876,4334,1750,1449
2327/2/11,11213,3920,4522,1818,1484
2327/2/12,11653,4106,4831,1881,1512
2327/2/13,20427,4343,5413,2537,1570
2327/2/14,24164,4666,5914,2636,1607
2327/2/15,22608,4901,5546,2812,1557
date,A,B,C,D,E
2327/1/24,70,22,0,2,0
2327/1/25,77,4,52,2,1
2327/1/26,46,29,58,23,7
2327/1/27,80,45,32,14,28
2327/1/28,892,73,59,24,34
2327/1/29,315,101,111,30,61
2327/1/30,356,125,172,50,32
2327/1/31,378,142,77,70,123
2327/2/1,576,87,153,66,61
2327/2/2,894,121,276,46,94
2327/2/3,1033,169,244,166,107
2327/2/4,1242,202,176,114,84
2327/2/5,1967,342,223,100,103
2327/2/6,1766,424,162,88,52
2327/2/7,1501,255,90,84,51
2327/2/8,1985,172,144,56,69
2327/2/9,1379,123,100,56,81
2327/2/10,1920,105,111,48,31
2327/2/11,1552,101,80,30,44
2327/2/12,1104,109,66,35,25
2327/2/13,13436,123,264,321,13
2327/2/14,2997,135,129,16,10
2327/2/15,1923,105,26,31,17
2327/2/16,1548,87,6,12,16
......@@ -124,7 +124,7 @@ def main(args):
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--n_route', type=int, default=74)
parser.add_argument('--n_route', type=int, default=5)
parser.add_argument('--n_his', type=int, default=23)
parser.add_argument('--n_pred', type=int, default=3)
parser.add_argument('--batch_size', type=int, default=10)
......
......@@ -19,10 +19,10 @@ import os
from datetime import datetime
import logging
from collections import defaultdict
from tensorboardX import SummaryWriter
import paddle.fluid as F
from pgl.utils.logger import log
from pgl.utils.log_writer import LogWriter
def multi_device(reader, dev_count):
......@@ -79,10 +79,10 @@ def train_and_evaluate(exe,
global_step = 0
timestamp = datetime.now().strftime("%Hh%Mm%Ss")
log_path = os.path.join(args.log_dir, "tensorboard_log_%s" % timestamp)
log_path = os.path.join(args.log_dir, "log_%s" % timestamp)
_create_if_not_exist(log_path)
writer = SummaryWriter(log_path)
writer = LogWriter(log_path)
best_valid_score = 0.0
for e in range(args.epoch):
......@@ -99,7 +99,7 @@ def train_and_evaluate(exe,
ret = model.metrics.parse(ret)
if global_step % args.train_log_step == 0:
writer.add_scalar(
"batch_loss", ret['loss'], global_step=global_step)
"batch_loss", ret['loss'], global_step)
log.info("epoch: %d | step: %d | loss: %.4f " %
(e, global_step, ret['loss']))
......@@ -111,7 +111,7 @@ def train_and_evaluate(exe,
for key, value in valid_ret.items():
message += "%s %.4f | " % (key, value)
writer.add_scalar(
"eval_%s" % key, value, global_step=global_step)
"eval_%s" % key, value, global_step)
log.info(message)
# testing
......@@ -120,7 +120,7 @@ def train_and_evaluate(exe,
for key, value in test_ret.items():
message += "%s %.4f | " % (key, value)
writer.add_scalar(
"test_%s" % key, value, global_step=global_step)
"test_%s" % key, value, global_step)
log.info(message)
# evaluate after one epoch
......@@ -128,7 +128,7 @@ def train_and_evaluate(exe,
message = "epoch %s valid: " % e
for key, value in valid_ret.items():
message += "%s %.4f | " % (key, value)
writer.add_scalar("eval_%s" % key, value, global_step=global_step)
writer.add_scalar("eval_%s" % key, value, global_step)
log.info(message)
# testing
......@@ -136,7 +136,7 @@ def train_and_evaluate(exe,
message = "epoch %s test: " % e
for key, value in test_ret.items():
message += "%s %.4f | " % (key, value)
writer.add_scalar("test_%s" % key, value, global_step=global_step)
writer.add_scalar("test_%s" % key, value, global_step)
log.info(message)
message = "epoch %s best %s result | " % (e, args.eval_metrics)
......
......@@ -18,7 +18,7 @@ import numpy as np
import sys
import os
import paddle.fluid as F
from tensorboardX import SummaryWriter
from pgl.utils.log_writer import LogWriter
from ogb.linkproppred import Evaluator
from ogb.linkproppred import LinkPropPredDataset
......@@ -115,7 +115,7 @@ def train_and_evaluate(exe,
log_path = os.path.join(output_path, "log")
_create_if_not_exist(log_path)
writer = SummaryWriter(log_path)
writer = LogWriter(log_path)
best_model = 0
for e in range(epoch):
......@@ -134,7 +134,7 @@ def train_and_evaluate(exe,
if global_step % train_log_step == 0:
for key, value in ret.items():
writer.add_scalar(
'train_' + key, value, global_step=global_step)
'train_' + key, value, global_step)
global_step += 1
if global_step % eval_step == 0:
......@@ -149,7 +149,7 @@ def train_and_evaluate(exe,
sys.stderr.write(json.dumps(eval_ret, indent=4) + "\n")
for key, value in eval_ret.items():
writer.add_scalar(key, value, global_step=global_step)
writer.add_scalar(key, value, global_step)
if eval_ret["valid_hits@100"] > best_model:
F.io.save_persistables(
......@@ -170,7 +170,7 @@ def train_and_evaluate(exe,
sys.stderr.write(json.dumps(eval_ret, indent=4) + "\n")
for key, value in eval_ret.items():
writer.add_scalar(key, value, global_step=global_step)
writer.add_scalar(key, value, global_step)
if eval_ret["valid_hits@100"] > best_model:
F.io.save_persistables(exe,
......
# Graph Node Prediction for Open Graph Benchmark (OGB) Arxiv dataset
[The Open Graph Benchmark (OGB)](https://ogb.stanford.edu/) is a collection of benchmark datasets, data loaders, and evaluators for graph machine learning. Here we complete the Graph Node Prediction task based on PGL.
### Requirements
paddlpaddle >= 1.7.1
pgl 1.0.2
ogb 1.1.1
### How to Run
```
CUDA_VISIBLE_DEVICES=0 python train.py \
--use_cuda 1 \
--num_workers 4 \
--output_path ./output/model_1 \
--batch_size 1024 \
--test_batch_size 512 \
--epoch 100 \
--learning_rate 0.001 \
--full_batch 0 \
--model gaan \
--drop_rate 0.5 \
--samples 8 8 8 \
--test_samples 20 20 20 \
--hidden_size 256
```
or
```
sh run.sh
```
The best record will be saved in ./output/model_1/best.txt.
### Hyperparameters
- use_cuda: whether to use gpu or not
- num_workers: the nums of sample workers
- output_path: path to save the model
- batch_size: batch size
- epoch: number of training epochs
- learning_rate: learning rate
- full_batch: run full batch of graph
- model: model to run, now gaan, sage, gcn, eta are available
- drop_rate: drop rate of the feature layers
- samples: the sample nums of each GNN layers
- hidden_size: the hidden size
### Performance
We train our models for 100 epochs and report the **acc** on the test dataset.
|dataset|mean|std|#experiments|
|-|-|-|-|
|ogbn-arxiv|0.7197|0.0024|16|
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""finetune args"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os
import time
import argparse
from utils.args import ArgumentGroup
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
model_g = ArgumentGroup(parser, "model", "model configuration and paths.")
model_g.add_arg("init_checkpoint", str, None, "Init checkpoint to resume training from.")
model_g.add_arg("init_pretraining_params", str, None,
"Init pre-training params which preforms fine-tuning from. If the "
"arg 'init_checkpoint' has been set, this argument wouldn't be valid.")
train_g = ArgumentGroup(parser, "training", "training options.")
train_g.add_arg("epoch", int, 3, "Number of epoches for fine-tuning.")
train_g.add_arg("learning_rate", float, 5e-5, "Learning rate used to train with warmup.")
run_type_g = ArgumentGroup(parser, "run_type", "running type options.")
run_type_g.add_arg("use_cuda", bool, True, "If set, use GPU for training.")
run_type_g.add_arg("num_workers", int, 4, "use multiprocess to generate graph")
run_type_g.add_arg("output_path", str, None, "path to save model")
run_type_g.add_arg("model", str, None, "model to run")
run_type_g.add_arg("hidden_size", int, 256, "model hidden-size")
run_type_g.add_arg("drop_rate", float, 0.5, "Dropout rate")
run_type_g.add_arg("batch_size", int, 1024, "batch_size")
run_type_g.add_arg("full_batch", bool, False, "use static graph wrapper, if full_batch is true, batch_size will take no effect.")
run_type_g.add_arg("samples", type=int, nargs='+', default=[30, 30], help="sample nums of k-hop.")
run_type_g.add_arg("test_batch_size", int, 512, help="sample nums of k-hop of test phase.")
run_type_g.add_arg("test_samples", type=int, nargs='+', default=[30, 30], help="sample nums of k-hop.")
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Base DataLoader
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os
import sys
import six
from io import open
from collections import namedtuple
import numpy as np
import tqdm
import paddle
from pgl.utils import mp_reader
import collections
import time
import pgl
if six.PY3:
import io
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='utf-8')
sys.stderr = io.TextIOWrapper(sys.stderr.buffer, encoding='utf-8')
def batch_iter(data, perm, batch_size, fid, num_workers):
"""node_batch_iter
"""
size = len(data)
start = 0
cc = 0
while start < size:
index = perm[start:start + batch_size]
start += batch_size
cc += 1
if cc % num_workers != fid:
continue
yield data[index]
def scan_batch_iter(data, batch_size, fid, num_workers):
"""node_batch_iter
"""
batch = []
cc = 0
for line_example in data.scan():
cc += 1
if cc % num_workers != fid:
continue
batch.append(line_example)
if len(batch) == batch_size:
yield batch
batch = []
if len(batch) > 0:
yield batch
class BaseDataGenerator(object):
"""Base Data Geneartor"""
def __init__(self, buf_size, batch_size, num_workers, shuffle=True):
self.num_workers = num_workers
self.batch_size = batch_size
self.line_examples = []
self.buf_size = buf_size
self.shuffle = shuffle
def batch_fn(self, batch_examples):
""" batch_fn batch producer"""
raise NotImplementedError("No defined Batch Fn")
def batch_iter(self, fid, perm):
""" batch iterator"""
if self.shuffle:
for batch in batch_iter(self, perm, self.batch_size, fid,
self.num_workers):
yield batch
else:
for batch in scan_batch_iter(self, self.batch_size, fid,
self.num_workers):
yield batch
def __len__(self):
return len(self.line_examples)
def __getitem__(self, idx):
if isinstance(idx, collections.Iterable):
return [self[bidx] for bidx in idx]
else:
return self.line_examples[idx]
def generator(self):
"""batch dict generator"""
def worker(filter_id, perm):
""" multiprocess worker"""
def func_run():
""" func_run """
pid = os.getpid()
np.random.seed(pid + int(time.time()))
for batch_examples in self.batch_iter(filter_id, perm):
batch_dict = self.batch_fn(batch_examples)
yield batch_dict
return func_run
# consume a seed
np.random.rand()
if self.shuffle:
perm = np.arange(0, len(self))
np.random.shuffle(perm)
else:
perm = None
if self.num_workers == 1:
r = paddle.reader.buffered(worker(0, perm), self.buf_size)
else:
worker_pool = [
worker(wid, perm) for wid in range(self.num_workers)
]
worker = mp_reader.multiprocess_reader(
worker_pool, use_pipe=True, queue_size=1000)
r = paddle.reader.buffered(worker, self.buf_size)
for batch in r():
yield batch
def scan(self):
for line_example in self.line_examples:
yield line_example
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
from dataloader.base_dataloader import BaseDataGenerator
from utils.to_undirected import to_undirected
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
from pgl.contrib.ogb.nodeproppred.dataset_pgl import PglNodePropPredDataset
from pgl.sample import graph_saint_random_walk_sample
from ogb.nodeproppred import Evaluator
import tqdm
from collections import namedtuple
import pgl
import numpy as np
import copy
def traverse(item):
"""traverse
"""
if isinstance(item, list) or isinstance(item, np.ndarray):
for i in iter(item):
for j in traverse(i):
yield j
else:
yield item
def flat_node_and_edge(nodes):
"""flat_node_and_edge
"""
nodes = list(set(traverse(nodes)))
return nodes
def k_hop_sampler(graph, samples, batch_nodes):
# for batch_train_samples, batch_train_labels in batch_info:
start_nodes = copy.deepcopy(batch_nodes)
nodes = start_nodes
edges = []
for max_deg in samples:
pred_nodes = graph.sample_predecessor(start_nodes, max_degree=max_deg)
for dst_node, src_nodes in zip(start_nodes, pred_nodes):
for src_node in src_nodes:
edges.append((src_node, dst_node))
last_nodes = nodes
nodes = [nodes, pred_nodes]
nodes = flat_node_and_edge(nodes)
# Find new nodes
start_nodes = list(set(nodes) - set(last_nodes))
if len(start_nodes) == 0:
break
subgraph = graph.subgraph(
nodes=nodes, edges=edges, with_node_feat=True, with_edge_feat=True)
sub_node_index = subgraph.reindex_from_parrent_nodes(batch_nodes)
return subgraph, sub_node_index
def graph_saint_randomwalk_sampler(graph, batch_nodes, max_depth=3):
subgraph = graph_saint_random_walk_sample(graph, batch_nodes, max_depth)
sub_node_index = subgraph.reindex_from_parrent_nodes(batch_nodes)
return subgraph, sub_node_index
class ArxivDataGenerator(BaseDataGenerator):
def __init__(self,
graph_wrapper=None,
buf_size=1000,
batch_size=128,
num_workers=1,
samples=[30, 30],
shuffle=True,
phase="train"):
super(ArxivDataGenerator, self).__init__(
buf_size=buf_size,
num_workers=num_workers,
batch_size=batch_size,
shuffle=shuffle)
self.samples = samples
self.d_name = "ogbn-arxiv"
self.graph_wrapper = graph_wrapper
dataset = PglNodePropPredDataset(name=self.d_name)
splitted_idx = dataset.get_idx_split()
self.phase = phase
graph, label = dataset[0]
graph = to_undirected(graph)
self.graph = graph
self.num_nodes = graph.num_nodes
if self.phase == 'train':
nodes_idx = splitted_idx["train"]
labels = label[nodes_idx]
elif self.phase == "valid":
nodes_idx = splitted_idx["valid"]
labels = label[nodes_idx]
elif self.phase == "test":
nodes_idx = splitted_idx["test"]
labels = label[nodes_idx]
self.nodes_idx = nodes_idx
self.labels = labels
self.sample_based_line_example(nodes_idx, labels)
def sample_based_line_example(self, nodes_idx, labels):
self.line_examples = []
Example = namedtuple('Example', ["node", "label"])
for node, label in zip(nodes_idx, labels):
self.line_examples.append(Example(node=node, label=label))
print("Phase", self.phase)
print("Len Examples", len(self.line_examples))
def batch_fn(self, batch_ex):
batch_nodes = []
cc = 0
batch_node_id = []
batch_labels = []
for ex in batch_ex:
batch_nodes.append(ex.node)
batch_labels.append(ex.label)
_graph_wrapper = copy.copy(self.graph_wrapper)
#if self.phase == "train":
# subgraph, sub_node_index = graph_saint_randomwalk_sampler(self.graph, batch_nodes)
#else:
subgraph, sub_node_index = k_hop_sampler(self.graph, self.samples,
batch_nodes)
feed_dict = _graph_wrapper.to_feed(subgraph)
feed_dict["batch_nodes"] = sub_node_index
feed_dict["labels"] = np.array(batch_labels, dtype="int64")
return feed_dict
此差异已折叠。
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""init"""
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""train and evaluate"""
import tqdm
import json
import numpy as np
import sys
import os
import paddle.fluid as F
from tensorboardX import SummaryWriter
from ogb.nodeproppred import Evaluator
from ogb.nodeproppred import NodePropPredDataset
def multi_device(reader, dev_count):
"""multi device"""
if dev_count == 1:
for batch in reader:
yield batch
else:
batches = []
for batch in reader:
batches.append(batch)
if len(batches) == dev_count:
yield batches
batches = []
class OgbEvaluator(object):
def __init__(self):
d_name = "ogbn-arxiv"
dataset = NodePropPredDataset(name=d_name)
graph, label = dataset[0]
self.num_nodes = graph["num_nodes"]
self.ogb_evaluator = Evaluator(name="ogbn-arxiv")
def eval(self, scores, labels, phase):
pred = (np.argmax(scores, axis=1)).reshape([-1, 1])
ret = {}
ret['%s_acc' % (phase)] = self.ogb_evaluator.eval({
'y_true': labels,
'y_pred': pred,
})['acc']
return ret
def evaluate(model, valid_exe, valid_ds, valid_prog, dev_count, evaluator,
phase, full_batch):
"""evaluate """
cc = 0
scores = []
labels = []
if full_batch:
valid_iter = _full_batch_wapper(valid_ds)
else:
valid_iter = valid_ds.generator
for feed_dict in tqdm.tqdm(
multi_device(valid_iter(), dev_count), desc='evaluating'):
if dev_count > 1:
output = valid_exe.run(feed=feed_dict,
fetch_list=[model.logits, model.labels])
else:
output = valid_exe.run(valid_prog,
feed=feed_dict,
fetch_list=[model.logits, model.labels])
scores.append(output[0])
labels.append(output[1])
scores = np.vstack(scores)
labels = np.vstack(labels)
ret = evaluator.eval(scores, labels, phase)
return ret
def _create_if_not_exist(path):
basedir = os.path.dirname(path)
if not os.path.exists(basedir):
os.makedirs(basedir)
def _full_batch_wapper(ds):
feed_dict = {}
feed_dict["batch_nodes"] = np.array(ds.nodes_idx, dtype="int64")
feed_dict["labels"] = np.array(ds.labels, dtype="int64")
def r():
yield feed_dict
return r
def train_and_evaluate(exe,
train_exe,
valid_exe,
train_ds,
valid_ds,
test_ds,
train_prog,
valid_prog,
full_batch,
model,
metric,
epoch=20,
dev_count=1,
train_log_step=5,
eval_step=10000,
evaluator=None,
output_path=None):
"""train and evaluate"""
global_step = 0
log_path = os.path.join(output_path, "log")
_create_if_not_exist(log_path)
writer = SummaryWriter(log_path)
best_model = 0
if full_batch:
train_iter = _full_batch_wapper(train_ds)
else:
train_iter = train_ds.generator
for e in range(epoch):
ret_sum_loss = 0
per_step = 0
scores = []
labels = []
for feed_dict in tqdm.tqdm(
multi_device(train_iter(), dev_count), desc='Epoch %s' % e):
if dev_count > 1:
ret = train_exe.run(feed=feed_dict, fetch_list=metric.vars)
ret = [[np.mean(v)] for v in ret]
else:
ret = train_exe.run(
train_prog,
feed=feed_dict,
fetch_list=[model.loss, model.logits, model.labels]
#fetch_list=metric.vars
)
scores.append(ret[1])
labels.append(ret[2])
ret = [ret[0]]
ret = metric.parse(ret)
if global_step % train_log_step == 0:
for key, value in ret.items():
writer.add_scalar(
'train_' + key, value, global_step=global_step)
ret_sum_loss += ret['loss']
per_step += 1
global_step += 1
if global_step % eval_step == 0:
eval_ret = evaluate(model, exe, valid_ds, valid_prog, 1,
evaluator, "valid", full_batch)
test_eval_ret = evaluate(model, exe, test_ds, valid_prog, 1,
evaluator, "test", full_batch)
eval_ret.update(test_eval_ret)
sys.stderr.write(json.dumps(eval_ret, indent=4) + "\n")
for key, value in eval_ret.items():
writer.add_scalar(key, value, global_step=global_step)
if eval_ret["valid_acc"] > best_model:
F.io.save_persistables(
exe,
os.path.join(output_path, "checkpoint"), train_prog)
eval_ret["epoch"] = e
#eval_ret["step"] = global_step
with open(os.path.join(output_path, "best.txt"), "w") as f:
f.write(json.dumps(eval_ret, indent=2) + '\n')
best_model = eval_ret["valid_acc"]
scores = np.vstack(scores)
labels = np.vstack(labels)
ret = evaluator.eval(scores, labels, "train")
sys.stderr.write(json.dumps(ret, indent=4) + "\n")
#print(json.dumps(ret, indent=4) + "\n")
# Epoch End
sys.stderr.write("epoch:{}, average loss {}\n".format(e, ret_sum_loss /
per_step))
eval_ret = evaluate(model, exe, valid_ds, valid_prog, 1, evaluator,
"valid", full_batch)
test_eval_ret = evaluate(model, exe, test_ds, valid_prog, 1, evaluator,
"test", full_batch)
eval_ret.update(test_eval_ret)
sys.stderr.write(json.dumps(eval_ret, indent=4) + "\n")
for key, value in eval_ret.items():
writer.add_scalar(key, value, global_step=global_step)
if eval_ret["valid_acc"] > best_model:
F.io.save_persistables(exe,
os.path.join(output_path, "checkpoint"),
train_prog)
#eval_ret["step"] = global_step
eval_ret["epoch"] = e
with open(os.path.join(output_path, "best.txt"), "w") as f:
f.write(json.dumps(eval_ret, indent=2) + '\n')
best_model = eval_ret["valid_acc"]
writer.close()
device=0
model='gaan'
lr=0.001
drop=0.5
CUDA_VISIBLE_DEVICES=${device} \
python -u train.py \
--use_cuda 1 \
--num_workers 4 \
--output_path ./output/model \
--batch_size 1024 \
--test_batch_size 512 \
--epoch 100 \
--learning_rate ${lr} \
--full_batch 0 \
--model ${model} \
--drop_rate ${drop} \
--samples 8 8 8 \
--test_samples 20 20 20 \
--hidden_size 256
此差异已折叠。
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""utils"""
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Arguments for configuration."""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import six
import os
import sys
import argparse
import logging
import paddle.fluid as fluid
log = logging.getLogger(__name__)
def prepare_logger(logger, debug=False, save_to_file=None):
"""doc"""
formatter = logging.Formatter(
fmt='[%(levelname)s] %(asctime)s [%(filename)12s:%(lineno)5d]:\t%(message)s'
)
#console_hdl = logging.StreamHandler()
#console_hdl.setFormatter(formatter)
#logger.addHandler(console_hdl)
if save_to_file is not None and not os.path.exists(save_to_file):
file_hdl = logging.FileHandler(save_to_file)
file_hdl.setFormatter(formatter)
logger.addHandler(file_hdl)
logger.setLevel(logging.DEBUG)
logger.propagate = False
def str2bool(v):
"""doc"""
# because argparse does not support to parse "true, False" as python
# boolean directly
return v.lower() in ("true", "t", "1")
class ArgumentGroup(object):
"""doc"""
def __init__(self, parser, title, des):
self._group = parser.add_argument_group(title=title, description=des)
def add_arg(self,
name,
type,
default,
help,
positional_arg=False,
**kwargs):
"""doc"""
prefix = "" if positional_arg else "--"
type = str2bool if type == bool else type
self._group.add_argument(
prefix + name,
default=default,
type=type,
help=help + ' Default: %(default)s.',
**kwargs)
def print_arguments(args):
"""doc"""
log.info('----------- Configuration Arguments -----------')
for arg, value in sorted(six.iteritems(vars(args))):
log.info('%s: %s' % (arg, value))
log.info('------------------------------------------------')
def check_cuda(use_cuda, err= \
"\nYou can not set use_cuda=True in the model because you are using paddlepaddle-cpu.\n \
Please: 1. Install paddlepaddle-gpu to run your models on GPU or 2. Set use_cuda=False to run models on CPU.\n"
):
"""doc"""
try:
if use_cuda == True and fluid.is_compiled_with_cuda() == False:
log.error(err)
sys.exit(1)
except Exception as e:
pass
此差异已折叠。
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Arguments for configuration."""
from __future__ import absolute_import
from __future__ import unicode_literals
import paddle.fluid as fluid
import pgl
import numpy as np
def to_undirected(graph):
inv_edges = np.zeros(graph.edges.shape)
inv_edges[:, 0] = graph.edges[:, 1]
inv_edges[:, 1] = graph.edges[:, 0]
edges = np.vstack((graph.edges, inv_edges))
g = pgl.graph.Graph(num_nodes=graph.num_nodes, edges=edges)
for k, v in graph._edge_feat.items():
g._edge_feat[k] = np.vstack((v, v))
for k, v in graph._node_feat.items():
g._node_feat[k] = v
return g
......@@ -21,3 +21,4 @@ from pgl import data_loader
from pgl import heter_graph
from pgl import heter_graph_wrapper
from pgl import contrib
from pgl import message_passing
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -4,3 +4,5 @@ cython >= 0.25.2
#paddlepaddle
redis-py-cluster
visualdl >= 2.0.0b ; python_version >= "3"
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册