提交 32ecdce0 编写于 作者: Y Yelrose

pgl version 1.0.0

上级 aeddfcd9
......@@ -23,7 +23,7 @@ repos:
sha: 5bf6c09bfa1297d3692cadd621ef95f1284e33c0
hooks:
- id: check-added-large-files
args: [--maxkb=1024]
args: [--maxkb=4096]
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
......
......@@ -2,7 +2,6 @@ sphinx==2.1.0
mistune
sphinx_rtd_theme
numpy >= 1.16.4
networkx==2.3
cython >= 0.25.2
paddlepaddle==1.5.1
paddlepaddle
pgl
......@@ -8,3 +8,4 @@ API Reference
pgl.layers
pgl.data_loader
pgl.utils.paddle_helper
pgl.utils.mp_reader
pgl.utils.mp\_reader module: MultiProcessing reader helper function for Paddle.
===============================
.. automodule:: pgl.utils.mp_reader
:members:
:undoc-members:
:show-inheritance:
......@@ -32,18 +32,18 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~83% | 0.0188s | 0.0175s |
| Pubmed | ~78% | 0.0449s | 0.0295s |
| Citeseer | ~70% | 0.0275 | 0.0253s |
| Dataset | Accuracy |
| --- | --- |
| Cora | ~83% |
| Pubmed | ~78% |
| Citeseer | ~70% |
### How to run
......
......@@ -27,18 +27,18 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~81% | 0.0106s | 0.0104s |
| Pubmed | ~79% | 0.0210s | 0.0154s |
| Citeseer | ~71% | 0.0175s | 0.0177s |
| Dataset | Accuracy |
| --- | --- |
| Cora | ~81% |
| Pubmed | ~79% |
| Citeseer | ~71% |
### How to run
......
......@@ -12,7 +12,7 @@ The reddit dataset should be downloaded from the following links and placed in d
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### How to run
......@@ -22,6 +22,14 @@ To train a GraphSAGE model on Reddit Dataset, you can just run
python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry
```
If you want to train a GraphSAGE model with multiple GPUs, you can just run
```
CUDA_VISIBLE_DEVICES=0,1 python train_multi.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry --num_trainer 2
```
#### Hyperparameters
- epoch: Number of epochs default (10)
......
......@@ -5,7 +5,7 @@
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3) and [Arxiv](http://snap.stanford.edu/data/ca-AstroPh.html).
## Dependencies
- paddlepaddle>=1.4
- paddlepaddle>=1.6
- pgl
## How to run
......
......@@ -19,11 +19,11 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gat | Improvement |
| --- | --- | --- |---| --- | --- |
| Cora | ~83% | 0.0145s | 0.0119s | 0.0175s | 1.47x |
| Pubmed | ~78% | 0.0352s | 0.0193s |0.0295s | 1.53x |
| Citeseer | ~70% | 0.0148s | 0.0124s |0.0253s | 2.04x |
| Dataset | Accuracy | epoch time | examples/gat | Improvement |
| --- | --- | --- | --- | --- |
| Cora | ~83% | 0.0119s | 0.0175s | 1.47x |
| Pubmed | ~78% | 0.0193s |0.0295s | 1.53x |
| Citeseer | ~70% | 0.0124s |0.0253s | 2.04x |
### How to run
......
......@@ -10,7 +10,7 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### Performance
......@@ -18,11 +18,11 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gcn | Improvement |
| Dataset | Accuracy | epoch time | examples/gcn | Improvement |
| --- | --- | --- |---| --- | --- |
| Cora | ~81% | 0.0053s | 0.0047s | 0.0104s | 2.21x |
| Pubmed | ~79% | 0.0105s | 0.0049s |0.0154s | 3.14x |
| Citeseer | ~71% | 0.0051s | 0.0045s |0.0177s | 3.93x |
| Cora | ~81% | 0.0047s | 0.0104s | 2.21x |
| Pubmed | ~79% | 0.0049s |0.0154s | 3.14x |
| Citeseer | ~71% | 0.0045s |0.0177s | 3.93x |
......
......@@ -8,8 +8,7 @@ To install Paddle Graph Learning, we need the following packages.
.. code-block:: sh
paddlepaddle >= 1.4 (Faster performance on 1.5)
networkx
paddlepaddle >= 1.6
cython
We can simply install pgl by pip.
......
......@@ -35,8 +35,8 @@ Users only need to call the ```sequence_ops``` functions provided by Paddle to e
return fluid.layers.sequence_pool(msg, "sum")
```
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
## Performance
......@@ -50,11 +50,3 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t
| Pubmed | GAT | 77% |0.0193s|**0.0144s**|
| Citeseer | GCN |70.2%| **0.0045** |0.0046s|
| Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) that aggregates neighbor features with LSTM ignoring the order of recieved messages, the optimized message-passing in DGL will be forced to degenerate into degree bucketing scheme. The speed performance will be much slower than the one implemented in PGL. Performances may be various with different scale of the graph, in our experiments, PGL can reach up to 13 times the speed of DGL.
| Dataset | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | Speed up|
| -------- | ------------ | ------------------------------------ |----|
| Cora | **0.0186s** | 0.1638s | 8.80x|
| Pubmed | **0.0388s** |0.5275s | 13.59x|
| Citeseer | **0.0150s** | 0.1278s | 8.52x |
......@@ -95,7 +95,7 @@ After defining the GCN layer, we can construct a deeper GCN model with two GCN l
```python
output = gcn_layer(gw, gw.node_feat['feature'],
hidden_size=8, name='gcn_layer_1', activation='relu')
output = gcn_layer(gw, output, hidden_size=1,
output = gcn_layer(gw, output, hidden_size=2,
name='gcn_layer_2', activation=None)
```
......
# PGL Examples for GCN
[Deep Graph Infomax \(DGI\)](https://arxiv.org/abs/1809.10341) is a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.6
- pgl
### Performance
We use DGI to pretrain embeddings for each nodes. Then we fix the embedding to train a node classifier.
| Dataset | Accuracy |
| --- | --- |
| Cora | ~81% |
| Pubmed | ~77.6% |
| Citeseer | ~71.3% |
### How to run
For examples, use gpu to train gcn on cora dataset.
```
python dgi.py --dataset cora --use_cuda
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
DGI Pretrain
"""
import os
import pgl
from pgl import data_loader
from pgl.utils.logger import log
import paddle.fluid as fluid
import numpy as np
import time
import argparse
def load(name):
"""Load dataset"""
if name == 'cora':
dataset = data_loader.CoraDataset()
elif name == "pubmed":
dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
elif name == "citeseer":
dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def save_param(dirname, var_name_list):
"""save_param"""
for var_name in var_name_list:
var = fluid.global_scope().find_var(var_name)
var_tensor = var.get_tensor()
np.save(os.path.join(dirname, var_name + '.npy'), np.array(var_tensor))
def main(args):
"""main"""
dataset = load(args.dataset)
# normalize
indegree = dataset.graph.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
hidden_size = 512
with fluid.program_guard(train_program, startup_program):
pos_gw = pgl.graph_wrapper.GraphWrapper(
name="pos_graph",
place=place,
node_feat=dataset.graph.node_feat_info())
neg_gw = pgl.graph_wrapper.GraphWrapper(
name="neg_graph",
place=place,
node_feat=dataset.graph.node_feat_info())
positive_feat = pgl.layers.gcn(pos_gw,
pos_gw.node_feat["words"],
hidden_size,
activation="relu",
norm=pos_gw.node_feat['norm'],
name="gcn_layer_1")
negative_feat = pgl.layers.gcn(neg_gw,
neg_gw.node_feat["words"],
hidden_size,
activation="relu",
norm=neg_gw.node_feat['norm'],
name="gcn_layer_1")
summary_feat = fluid.layers.sigmoid(
fluid.layers.reduce_mean(
positive_feat, [0], keep_dim=True))
summary_feat = fluid.layers.fc(summary_feat,
hidden_size,
bias_attr=False,
name="discriminator")
pos_logits = fluid.layers.matmul(
positive_feat, summary_feat, transpose_y=True)
neg_logits = fluid.layers.matmul(
negative_feat, summary_feat, transpose_y=True)
pos_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
x=pos_logits,
label=fluid.layers.ones(
shape=[dataset.graph.num_nodes, 1], dtype="float32"))
neg_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
x=neg_logits,
label=fluid.layers.zeros(
shape=[dataset.graph.num_nodes, 1], dtype="float32"))
loss = fluid.layers.reduce_mean(pos_loss) + fluid.layers.reduce_mean(
neg_loss)
adam = fluid.optimizer.Adam(learning_rate=1e-3)
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(startup_program)
best_loss = 1e9
dur = []
for epoch in range(args.epoch):
feed_dict = pos_gw.to_feed(dataset.graph)
node_feat = dataset.graph.node_feat["words"].copy()
perm = np.arange(0, dataset.graph.num_nodes)
np.random.shuffle(perm)
dataset.graph.node_feat["words"] = dataset.graph.node_feat["words"][
perm]
feed_dict.update(neg_gw.to_feed(dataset.graph))
dataset.graph.node_feat["words"] = node_feat
if epoch >= 3:
t0 = time.time()
train_loss = exe.run(train_program,
feed=feed_dict,
fetch_list=[loss],
return_numpy=True)
if train_loss[0] < best_loss:
best_loss = train_loss[0]
save_param(args.checkpoint, ["gcn_layer_1", "gcn_layer_1_bias"])
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
"Train Loss: %f " % train_loss[0])
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='DGI pretrain')
parser.add_argument(
"--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
parser.add_argument(
"--checkpoint", type=str, default="best_model", help="checkpoint")
parser.add_argument(
"--epoch", type=int, default=200, help="pretrain epochs")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
args = parser.parse_args()
log.info(args)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Train
"""
import os
import pgl
from pgl import data_loader
from pgl.utils.logger import log
import paddle.fluid as fluid
import numpy as np
import time
import argparse
def load(name):
"""Load"""
if name == 'cora':
dataset = data_loader.CoraDataset()
elif name == "pubmed":
dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
elif name == "citeseer":
dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def load_param(dirname, var_name_list):
"""load_param"""
for var_name in var_name_list:
var = fluid.global_scope().find_var(var_name)
var_tensor = var.get_tensor()
var_tmp = np.load(os.path.join(dirname, var_name + '.npy'))
var_tensor.set(var_tmp, fluid.CPUPlace())
def main(args):
"""main"""
dataset = load(args.dataset)
# normalize
indegree = dataset.graph.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
test_program = fluid.Program()
hidden_size = 512
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.GraphWrapper(
name="graph",
place=place,
node_feat=dataset.graph.node_feat_info())
output = pgl.layers.gcn(gw,
gw.node_feat["words"],
hidden_size,
activation="relu",
norm=gw.node_feat['norm'],
name="gcn_layer_1")
output.stop_gradient = True
output = fluid.layers.fc(output,
dataset.num_classes,
act=None,
name="classifier")
node_index = fluid.layers.data(
"node_index",
shape=[None, 1],
dtype="int64",
append_batch_size=False)
node_label = fluid.layers.data(
"node_label",
shape=[None, 1],
dtype="int64",
append_batch_size=False)
pred = fluid.layers.gather(output, node_index)
loss, pred = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=node_label, return_softmax=True)
acc = fluid.layers.accuracy(input=pred, label=node_label, k=1)
loss = fluid.layers.mean(loss)
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=1e-2)
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(startup_program)
load_param(args.checkpoint, ["gcn_layer_1", "gcn_layer_1_bias"])
feed_dict = gw.to_feed(dataset.graph)
train_index = dataset.train_index
train_label = np.expand_dims(dataset.y[train_index], -1)
train_index = np.expand_dims(train_index, -1)
val_index = dataset.val_index
val_label = np.expand_dims(dataset.y[val_index], -1)
val_index = np.expand_dims(val_index, -1)
test_index = dataset.test_index
test_label = np.expand_dims(dataset.y[test_index], -1)
test_index = np.expand_dims(test_index, -1)
dur = []
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
feed_dict["node_index"] = np.array(train_index, dtype="int64")
feed_dict["node_label"] = np.array(train_label, dtype="int64")
train_loss, train_acc = exe.run(train_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
feed_dict["node_index"] = np.array(val_index, dtype="int64")
feed_dict["node_label"] = np.array(val_label, dtype="int64")
val_loss, val_acc = exe.run(test_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
"Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+ "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
feed_dict["node_index"] = np.array(test_index, dtype="int64")
feed_dict["node_label"] = np.array(test_label, dtype="int64")
test_loss, test_acc = exe.run(test_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
log.info("Accuracy: %f" % test_acc)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='GCN')
parser.add_argument(
"--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
parser.add_argument(
"--checkpoint", type=str, default="best_model", help="checkpoint")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
args = parser.parse_args()
log.info(args)
main(args)
# PGL Examples for distributed deepwalk
[Deepwalk](https://arxiv.org/pdf/1403.6652.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce distributed deepwalk algorithms and reach the same level of indicators as the paper.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3).
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0
## How to run
For examples, train deepwalk in distributed mode on cora dataset.
```sh
# train deepwalk in distributed mode.
sh cloud_run.sh
# multiclass task example
python3 multi_class.py --use_cuda --ckpt_path ./model_path/4029 --epoch 1000
```
## Hyperparameters
- dataset: The citation dataset "BlogCatalog".
- hidden_size: Hidden size of the embedding.
- lr: Learning rate.
- neg_num: Number of negative samples.
- epoch: Number of training epoch.
### Experiment results
Dataset|model|Task|Metric|PGL Result|Reported Result
--|--|--|--|--|--
BlogCatalog|distributed deepwalk|multi-label classification|MacroF1|0.233|0.211
#!/bin/bash
set -x
source ./pgl_deepwalk.cfg
source ./local_config
unset http_proxy https_proxy
# build train_data
trainer_num=`echo $PADDLE_PORT | awk -F',' '{print NF}'`
rm -rf train_data && mkdir -p train_data
cd train_data
if [[ $build_train_data == True ]];then
seq 0 $((num_nodes-1)) | shuf | split -l $((num_nodes/trainer_num/CPU_NUM+1))
else
for i in `seq 1 $trainer_num`;do
touch $i
done
fi
cd -
# mkdir workspace
if [ -d ${BASE} ]; then
rm -rf ${BASE}
fi
mkdir ${BASE}
# start ps
for((i=0;i<${PADDLE_PSERVERS_NUM};i++))
do
echo "start ps server: ${i}"
echo $BASE
TRAINING_ROLE="PSERVER" PADDLE_TRAINER_ID=${i} sh job.sh &> $BASE/pserver.$i.log &
done
sleep 5s
# start trainers
for((j=0;j<${PADDLE_TRAINERS_NUM};j++))
do
echo "start ps work: ${j}"
TRAINING_ROLE="TRAINER" PADDLE_TRAINER_ID=${j} sh job.sh &> $BASE/worker.$j.log &
done
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import os
import math
from multiprocessing import Process
import numpy as np
import paddle.fluid as F
import paddle.fluid.layers as L
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from pgl.utils.logger import log
from pgl import data_loader
from reader import DeepwalkReader
from model import DeepwalkModel
from utils import get_file_list
from utils import build_graph
from utils import build_fake_graph
from utils import build_gen_func
def init_role():
# reset the place according to role of parameter server
training_role = os.getenv("TRAINING_ROLE", "TRAINER")
paddle_role = role_maker.Role.WORKER
place = F.CPUPlace()
if training_role == "PSERVER":
paddle_role = role_maker.Role.SERVER
# set the fleet runtime environment according to configure
ports = os.getenv("PADDLE_PORT", "6174").split(",")
pserver_ips = os.getenv("PADDLE_PSERVERS").split(",") # ip,ip...
eplist = []
if len(ports) > 1:
# local debug mode, multi port
for port in ports:
eplist.append(':'.join([pserver_ips[0], port]))
else:
# distributed mode, multi ip
for ip in pserver_ips:
eplist.append(':'.join([ip, ports[0]]))
pserver_endpoints = eplist # ip:port,ip:port...
worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
role = role_maker.UserDefinedRoleMaker(
current_id=trainer_id,
role=paddle_role,
worker_num=worker_num,
server_endpoints=pserver_endpoints)
fleet.init(role)
def optimization(base_lr, loss, train_steps, optimizer='sgd'):
decayed_lr = L.learning_rate_scheduler.polynomial_decay(
learning_rate=base_lr,
decay_steps=train_steps,
end_learning_rate=0.0001 * base_lr,
power=1.0,
cycle=False)
if optimizer == 'sgd':
optimizer = F.optimizer.SGD(decayed_lr)
elif optimizer == 'adam':
optimizer = F.optimizer.Adam(decayed_lr, lazy_mode=True)
else:
raise ValueError
log.info('learning rate:%f' % (base_lr))
#create the DistributeTranspiler configure
config = DistributeTranspilerConfig()
config.sync_mode = False
#config.runtime_split_send_recv = False
config.slice_var_up = False
#create the distributed optimizer
optimizer = fleet.distributed_optimizer(optimizer, config)
optimizer.minimize(loss)
def build_complied_prog(train_program, model_loss):
num_threads = int(os.getenv("CPU_NUM", 10))
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
exec_strategy = F.ExecutionStrategy()
exec_strategy.num_threads = num_threads
#exec_strategy.use_experimental_executor = True
build_strategy = F.BuildStrategy()
build_strategy.enable_inplace = True
#build_strategy.memory_optimize = True
build_strategy.memory_optimize = False
build_strategy.remove_unnecessary_lock = False
if num_threads > 1:
build_strategy.reduce_strategy = F.BuildStrategy.ReduceStrategy.Reduce
compiled_prog = F.compiler.CompiledProgram(
train_program).with_data_parallel(
loss_name=model_loss.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy)
return compiled_prog
def train_prog(exe, program, loss, node2vec_pyreader, args, train_steps):
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
step = 0
while True:
try:
begin_time = time.time()
loss_val, = exe.run(program, fetch_list=[loss])
log.info("step %s: loss %.5f speed: %.5f s/step" %
(step, np.mean(loss_val), time.time() - begin_time))
step += 1
except F.core.EOFException:
node2vec_pyreader.reset()
if step % args.steps_per_save == 0 or step == train_steps:
if trainer_id == 0 or args.is_distributed:
model_save_dir = args.save_path
model_path = os.path.join(model_save_dir, str(step))
if not os.path.exists(model_save_dir):
os.makedirs(model_save_dir)
fleet.save_persistables(exe, model_path)
if step == train_steps:
break
def test(args):
graph = build_graph(args.num_nodes, args.edge_path)
gen_func = build_gen_func(args, graph)
start = time.time()
num = 10
for idx, _ in enumerate(gen_func()):
if idx % num == num - 1:
log.info("%s" % (1.0 * (time.time() - start) / num))
start = time.time()
def walk(args):
graph = build_graph(args.num_nodes, args.edge_path)
num_sample_workers = args.num_sample_workers
if args.train_files is None or args.train_files == "None":
log.info("Walking from graph...")
train_files = [None for _ in range(num_sample_workers)]
else:
log.info("Walking from train_data...")
files = get_file_list(args.train_files)
train_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(files):
train_files[idx % num_sample_workers].append(f)
def walk_to_file(walk_gen, filename, max_num):
with open(filename, "w") as outf:
num = 0
for walks in walk_gen:
for walk in walks:
outf.write("%s\n" % "\t".join([str(i) for i in walk]))
num += 1
if num % 1000 == 0:
log.info("Total: %s, %s walkpath is saved. " %
(max_num, num))
if num == max_num:
return
m_args = [(DeepwalkReader(
graph,
batch_size=args.batch_size,
walk_len=args.walk_len,
win_size=args.win_size,
neg_num=args.neg_num,
neg_sample_type=args.neg_sample_type,
walkpath_files=None,
train_files=train_files[i]).walk_generator(),
"%s/%s" % (args.walkpath_files, i),
args.epoch * args.num_nodes // args.num_sample_workers)
for i in range(num_sample_workers)]
ps = []
for i in range(num_sample_workers):
p = Process(target=walk_to_file, args=m_args[i])
p.start()
ps.append(p)
for i in range(num_sample_workers):
ps[i].join()
def train(args):
import logging
log.setLevel(logging.DEBUG)
log.info("start")
worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
num_devices = int(os.getenv("CPU_NUM", 10))
model = DeepwalkModel(args.num_nodes, args.hidden_size, args.neg_num,
args.is_sparse, args.is_distributed, 1.)
pyreader = model.pyreader
loss = model.forward()
# init fleet
init_role()
train_steps = math.ceil(1. * args.num_nodes * args.epoch /
args.batch_size / num_devices / worker_num)
log.info("Train step: %s" % train_steps)
if args.optimizer == "sgd":
args.lr *= args.batch_size * args.walk_len * args.win_size
optimization(args.lr, loss, train_steps, args.optimizer)
# init and run server or worker
if fleet.is_server():
fleet.init_server(args.warm_start_from_dir)
fleet.run_server()
if fleet.is_worker():
log.info("start init worker done")
fleet.init_worker()
#just the worker, load the sample
log.info("init worker done")
exe = F.Executor(F.CPUPlace())
exe.run(fleet.startup_program)
log.info("Startup done")
if args.dataset is not None:
if args.dataset == "BlogCatalog":
graph = data_loader.BlogCatalogDataset().graph
elif args.dataset == "ArXiv":
graph = data_loader.ArXivDataset().graph
else:
raise ValueError(args.dataset + " dataset doesn't exists")
log.info("Load buildin BlogCatalog dataset done.")
elif args.walkpath_files is None or args.walkpath_files == "None":
graph = build_graph(args.num_nodes, args.edge_path)
log.info("Load graph from '%s' done." % args.edge_path)
else:
graph = build_fake_graph(args.num_nodes)
log.info("Load fake graph done.")
# bind gen
gen_func = build_gen_func(args, graph)
pyreader.decorate_tensor_provider(gen_func)
pyreader.start()
compiled_prog = build_complied_prog(fleet.main_program, loss)
train_prog(exe, compiled_prog, loss, pyreader, args, train_steps)
if __name__ == '__main__':
def str2bool(v):
if isinstance(v, bool):
return v
if v.lower() in ('yes', 'true', 't', 'y', '1'):
return True
elif v.lower() in ('no', 'false', 'f', 'n', '0'):
return False
else:
raise argparse.ArgumentTypeError('Boolean value expected.')
parser = argparse.ArgumentParser(description='Deepwalk')
parser.add_argument(
"--hidden_size",
type=int,
default=64,
help="Hidden size of the embedding.")
parser.add_argument(
"--lr", type=float, default=0.025, help="Learning rate.")
parser.add_argument(
"--neg_num", type=int, default=5, help="Number of negative samples.")
parser.add_argument(
"--epoch", type=int, default=1, help="Number of training epoch.")
parser.add_argument(
"--batch_size",
type=int,
default=128,
help="Numbert of walk paths in a batch.")
parser.add_argument(
"--walk_len", type=int, default=40, help="Length of a walk path.")
parser.add_argument(
"--win_size", type=int, default=5, help="Window size in skip-gram.")
parser.add_argument(
"--save_path",
type=str,
default="model_path",
help="Output path for saving model.")
parser.add_argument(
"--num_sample_workers",
type=int,
default=1,
help="Number of sampling workers.")
parser.add_argument(
"--steps_per_save",
type=int,
default=3000,
help="Steps for model saveing.")
parser.add_argument(
"--num_nodes",
type=int,
default=10000,
help="Number of nodes in graph.")
parser.add_argument("--edge_path", type=str, default="./graph_data")
parser.add_argument("--train_files", type=str, default=None)
parser.add_argument("--walkpath_files", type=str, default=None)
parser.add_argument("--is_distributed", type=str2bool, default=False)
parser.add_argument("--is_sparse", type=str2bool, default=True)
parser.add_argument("--warm_start_from_dir", type=str, default=None)
parser.add_argument("--dataset", type=str, default=None)
parser.add_argument(
"--neg_sample_type",
type=str,
default="average",
choices=["average", "outdegree"])
parser.add_argument(
"--mode",
type=str,
required=False,
choices=['train', 'walk'],
default="train")
parser.add_argument(
"--optimizer",
type=str,
required=False,
choices=['adam', 'sgd'],
default="sgd")
args = parser.parse_args()
log.info(args)
if args.mode == "train":
train(args)
elif args.mode == "walk":
walk(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import os
import numpy as np
import paddle.fluid as F
import paddle.fluid.layers as L
from pgl.utils.logger import log
from model import DeepwalkModel
from utils import build_graph
from utils import build_gen_func
def optimization(base_lr, loss, train_steps, optimizer='adam'):
decayed_lr = L.polynomial_decay(base_lr, train_steps, 0.0001)
if optimizer == 'sgd':
optimizer = F.optimizer.SGD(
decayed_lr,
regularization=F.regularizer.L2DecayRegularizer(
regularization_coeff=0.0025))
elif optimizer == 'adam':
# dont use gpu's lazy mode
optimizer = F.optimizer.Adam(decayed_lr)
else:
raise ValueError
log.info('learning rate:%f' % (base_lr))
optimizer.minimize(loss)
def get_parallel_exe(program, loss):
exec_strategy = F.ExecutionStrategy()
exec_strategy.num_threads = 1 #2 for fp32 4 for fp16
exec_strategy.use_experimental_executor = True
exec_strategy.num_iteration_per_drop_scope = 1 #important shit
build_strategy = F.BuildStrategy()
build_strategy.enable_inplace = True
build_strategy.memory_optimize = True
build_strategy.remove_unnecessary_lock = True
#return compiled_prog
train_exe = F.ParallelExecutor(
use_cuda=True,
loss_name=loss.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy,
main_program=program)
return train_exe
def train(train_exe, exe, program, loss, node2vec_pyreader, args, train_steps):
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
step = 0
while True:
try:
begin_time = time.time()
loss_val, = train_exe.run(fetch_list=[loss])
log.info("step %s: loss %.5f speed: %.5f s/step" %
(step, np.mean(loss_val), time.time() - begin_time))
step += 1
except F.core.EOFException:
node2vec_pyreader.reset()
if (step == train_steps or
step % args.steps_per_save == 0) and trainer_id == 0:
model_save_dir = args.output_path
model_path = os.path.join(model_save_dir, str(step))
if not os.path.exists(model_save_dir):
os.makedirs(model_save_dir)
F.io.save_params(exe, model_path, program)
if step == train_steps:
break
def main(args):
import logging
log.setLevel(logging.DEBUG)
log.info("start")
num_devices = len(F.cuda_places())
model = DeepwalkModel(args.num_nodes, args.hidden_size, args.neg_num,
False, False, 1.)
pyreader = model.pyreader
loss = model.forward()
train_steps = int(args.num_nodes * args.epoch / args.batch_size /
num_devices)
optimization(args.lr * num_devices, loss, train_steps, args.optimizer)
place = F.CUDAPlace(0)
exe = F.Executor(place)
exe.run(F.default_startup_program())
graph = build_graph(args.num_nodes, args.edge_path)
gen_func = build_gen_func(args, graph)
pyreader.decorate_tensor_provider(gen_func)
pyreader.start()
train_prog = F.default_main_program()
if args.warm_start_from_dir is not None:
F.io.load_params(exe, args.warm_start_from_dir, train_prog)
train_exe = get_parallel_exe(train_prog, loss)
train(train_exe, exe, train_prog, loss, pyreader, args, train_steps)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Deepwalk')
parser.add_argument("--hidden_size", type=int, default=64)
parser.add_argument("--lr", type=float, default=0.025)
parser.add_argument("--neg_num", type=int, default=5)
parser.add_argument("--epoch", type=int, default=100)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--walk_len", type=int, default=40)
parser.add_argument("--win_size", type=int, default=5)
parser.add_argument("--output_path", type=str, default="output")
parser.add_argument("--num_sample_workers", type=int, default=1)
parser.add_argument("--steps_per_save", type=int, default=3000)
parser.add_argument("--num_nodes", type=int, default=10000)
parser.add_argument("--edge_path", type=str, default="./graph_data")
parser.add_argument("--walkpath_files", type=str, default=None)
parser.add_argument("--train_files", type=str, default="./train_data")
parser.add_argument("--warm_start_from_dir", type=str, default=None)
parser.add_argument(
"--neg_sample_type",
type=str,
default="average",
choices=["average", "outdegree"])
parser.add_argument(
"--optimizer",
type=str,
required=False,
choices=['adam', 'sgd'],
default="adam")
args = parser.parse_args()
log.info(args)
main(args)
#!/bin/bash
set -x
source ./pgl_deepwalk.cfg
export CPU_NUM=$CPU_NUM
export FLAGS_rpc_deadline=3000000
export FLAGS_communicator_send_queue_size=1
export FLAGS_communicator_min_send_grad_num_before_recv=0
export FLAGS_communicator_max_merge_var_num=1
export FLAGS_communicator_merge_sparse_grad=1
if [[ $build_train_data == True ]];then
train_files="./train_data"
else
train_files="None"
fi
if [[ $pre_walk == True ]]; then
walkpath_files="./walk_path"
if [[ $TRAINING_ROLE == "PSERVER" ]];then
while [[ ! -d train_data ]];do
sleep 60
echo "Waiting for train_data ..."
done
rm -rf $walkpath_files && mkdir -p $walkpath_files
python -u cluster_train.py --num_sample_workers $num_sample_workers --num_nodes $num_nodes --mode walk --walkpath_files $walkpath_files --epoch $epoch \
--walk_len $walk_len --batch_size $batch_size --train_files $train_files --dataset "BlogCatalog"
touch build_graph_done
fi
while [[ ! -f build_graph_done ]];do
sleep 60
echo "Waiting for walk_path ..."
done
else
walkpath_files="None"
fi
python -u cluster_train.py --num_sample_workers $num_sample_workers --num_nodes $num_nodes --optimizer $optimizer --walkpath_files $walkpath_files --epoch $epoch \
--is_distributed $distributed_embedding --lr $learning_rate --neg_num $neg_num --walk_len $walk_len --win_size $win_size --is_sparse $is_sparse --hidden_size $dim \
--batch_size $batch_size --steps_per_save $steps_per_save --train_files $train_files --dataset "BlogCatalog"
#!/bin/bash
export PADDLE_TRAINERS_NUM=2
export PADDLE_PSERVERS_NUM=2
export PADDLE_PORT=6184,6185
export PADDLE_PSERVERS="127.0.0.1"
export BASE="./local_dir"
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Deepwalk model file.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import math
import paddle.fluid.layers as L
import paddle.fluid as F
def split_embedding(input,
dict_size,
hidden_size,
initializer,
name,
num_part=16,
is_sparse=False,
learning_rate=1.0):
""" split_embedding
"""
_part_size = hidden_size // num_part
if hidden_size % num_part != 0:
_part_size += 1
output_embedding = []
p_num = 0
while hidden_size > 0:
_part_size = min(_part_size, hidden_size)
hidden_size -= _part_size
print("part", p_num, "size=", (dict_size, _part_size))
part_embedding = L.embedding(
input=input,
size=(dict_size, _part_size),
is_sparse=is_sparse,
is_distributed=False,
param_attr=F.ParamAttr(
name=name + '_part%s' % p_num,
initializer=initializer,
learning_rate=learning_rate))
p_num += 1
output_embedding.append(part_embedding)
return L.concat(output_embedding, -1)
class DeepwalkModel(object):
def __init__(self,
num_nodes,
hidden_size=16,
neg_num=5,
is_sparse=False,
is_distributed=False,
embedding_lr=1.0):
self.pyreader = L.py_reader(
capacity=70,
shapes=[[-1, 1, 1], [-1, neg_num + 1, 1]],
dtypes=['int64', 'int64'],
lod_levels=[0, 0],
name='train',
use_double_buffer=True)
self.num_nodes = num_nodes
self.neg_num = neg_num
self.embed_init = F.initializer.Uniform(
low=-1. / math.sqrt(hidden_size), high=1. / math.sqrt(hidden_size))
self.is_sparse = is_sparse
self.is_distributed = is_distributed
self.hidden_size = hidden_size
self.loss = None
self.embedding_lr = embedding_lr
max_hidden_size = int(math.pow(2, 31) / 4 / num_nodes)
self.num_part = int(math.ceil(1. * hidden_size / max_hidden_size))
def forward(self):
src, dsts = L.read_file(self.pyreader)
if self.is_sparse:
# sparse mode use 2 dims input.
src = L.reshape(src, [-1, 1])
dsts = L.reshape(dsts, [-1, 1])
if self.num_part is not None and self.num_part != 1 and not self.is_distributed:
src_embed = split_embedding(
src,
self.num_nodes,
self.hidden_size,
self.embed_init,
"weight",
self.num_part,
self.is_sparse,
learning_rate=self.embedding_lr)
dsts_embed = split_embedding(
dsts,
self.num_nodes,
self.hidden_size,
self.embed_init,
"weight",
self.num_part,
self.is_sparse,
learning_rate=self.embedding_lr)
else:
src_embed = L.embedding(
src, (self.num_nodes, self.hidden_size),
self.is_sparse,
self.is_distributed,
param_attr=F.ParamAttr(
name="weight",
learning_rate=self.embedding_lr,
initializer=self.embed_init))
dsts_embed = L.embedding(
dsts, (self.num_nodes, self.hidden_size),
self.is_sparse,
self.is_distributed,
param_attr=F.ParamAttr(
name="weight",
learning_rate=self.embedding_lr,
initializer=self.embed_init))
if self.is_sparse:
# reshape back
src_embed = L.reshape(src_embed, [-1, 1, self.hidden_size])
dsts_embed = L.reshape(dsts_embed,
[-1, self.neg_num + 1, self.hidden_size])
logits = L.matmul(
src_embed, dsts_embed,
transpose_y=True) # [batch_size, 1, neg_num+1]
pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", 1)
neg_label = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 0)
label = L.concat([pos_label, neg_label], -1)
pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", self.neg_num)
neg_weight = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 1)
weight = L.concat([pos_weight, neg_weight], -1)
weight.stop_gradient = True
label.stop_gradient = True
loss = L.sigmoid_cross_entropy_with_logits(logits, label)
loss = loss * weight
loss = L.reduce_mean(loss)
loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
loss.persistable = True
self.loss = loss
return loss
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimized Multiprocessing Reader for PaddlePaddle
"""
import multiprocessing
import numpy as np
import time
import paddle.fluid as fluid
import pyarrow
def _serialize_serializable(obj):
"""Serialize Feed Dict
"""
return {"type": type(obj), "data": obj.__dict__}
def _deserialize_serializable(obj):
"""Deserialize Feed Dict
"""
val = obj["type"].__new__(obj["type"])
val.__dict__.update(obj["data"])
return val
context = pyarrow.default_serialization_context()
context.register_type(
object,
"object",
custom_serializer=_serialize_serializable,
custom_deserializer=_deserialize_serializable)
def serialize_data(data):
"""serialize_data"""
return pyarrow.serialize(data, context=context).to_buffer().to_pybytes()
def deserialize_data(data):
"""deserialize_data"""
return pyarrow.deserialize(data, context=context)
def multiprocess_reader(readers, use_pipe=True, queue_size=1000):
"""
multiprocess_reader use python multi process to read data from readers
and then use multiprocess.Queue or multiprocess.Pipe to merge all
data. The process number is equal to the number of input readers, each
process call one reader.
Multiprocess.Queue require the rw access right to /dev/shm, some
platform does not support.
you need to create multiple readers first, these readers should be independent
to each other so that each process can work independently.
An example:
.. code-block:: python
reader0 = reader(["file01", "file02"])
reader1 = reader(["file11", "file12"])
reader1 = reader(["file21", "file22"])
reader = multiprocess_reader([reader0, reader1, reader2],
queue_size=100, use_pipe=False)
"""
assert type(readers) is list and len(readers) > 0
def _read_into_queue(reader, queue):
"""read_into_queue"""
for sample in reader():
if sample is None:
raise ValueError("sample has None")
queue.put(serialize_data(sample))
queue.put(serialize_data(None))
def queue_reader():
"""queue_reader"""
queue = multiprocessing.Queue(queue_size)
for reader in readers:
p = multiprocessing.Process(
target=_read_into_queue, args=(reader, queue))
p.start()
reader_num = len(readers)
finish_num = 0
while finish_num < reader_num:
sample = deserialize_data(queue.get())
if sample is None:
finish_num += 1
else:
yield sample
def _read_into_pipe(reader, conn):
"""read_into_pipe"""
for sample in reader():
if sample is None:
raise ValueError("sample has None!")
conn.send(serialize_data(sample))
conn.send(serialize_data(None))
conn.close()
def pipe_reader():
"""pipe_reader"""
conns = []
for reader in readers:
parent_conn, child_conn = multiprocessing.Pipe()
conns.append(parent_conn)
p = multiprocessing.Process(
target=_read_into_pipe, args=(reader, child_conn))
p.start()
reader_num = len(readers)
finish_num = 0
conn_to_remove = []
finish_flag = np.zeros(len(conns), dtype="int32")
while finish_num < reader_num:
for conn_id, conn in enumerate(conns):
if finish_flag[conn_id] > 0:
continue
buff = conn.recv()
now = time.time()
sample = deserialize_data(buff)
out = time.time() - now
if sample is None:
finish_num += 1
conn.close()
finish_flag[conn_id] = 1
else:
yield sample
if use_pipe:
return pipe_reader
else:
return queue_reader
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import math
import os
import numpy as np
import sklearn.metrics
from sklearn.metrics import f1_score
import pgl
from pgl import data_loader
from pgl.utils import op
from pgl.utils.logger import log
import paddle.fluid as fluid
import paddle.fluid.layers as l
np.random.seed(123)
def load(name):
if name == 'BlogCatalog':
dataset = data_loader.BlogCatalogDataset()
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def node_classify_model(graph,
num_labels,
hidden_size=16,
name='node_classify_task'):
pyreader = l.py_reader(
capacity=70,
shapes=[[-1, 1], [-1, num_labels]],
dtypes=['int64', 'float32'],
lod_levels=[0, 0],
name=name + '_pyreader',
use_double_buffer=True)
nodes, labels = l.read_file(pyreader)
embed_nodes = l.embedding(
input=nodes,
size=[graph.num_nodes, hidden_size],
param_attr=fluid.ParamAttr(name='weight'))
embed_nodes.stop_gradient = True
logits = l.fc(input=embed_nodes, size=num_labels)
loss = l.sigmoid_cross_entropy_with_logits(logits, labels)
loss = l.reduce_mean(loss)
prob = l.sigmoid(logits)
topk = l.reduce_sum(labels, -1)
return pyreader, loss, prob, labels, topk
def node_classify_generator(graph,
all_nodes=None,
batch_size=512,
epoch=1,
shuffle=True):
if all_nodes is None:
all_nodes = np.arange(graph.num_nodes)
#labels = (np.random.rand(512, 39) > 0.95).astype(np.float32)
def batch_nodes_generator(shuffle=shuffle):
perm = np.arange(len(all_nodes), dtype=np.int64)
if shuffle:
np.random.shuffle(perm)
start = 0
while start < len(all_nodes):
yield all_nodes[perm[start:start + batch_size]]
start += batch_size
def wrapper():
for _ in range(epoch):
for batch_nodes in batch_nodes_generator():
batch_nodes_expanded = np.expand_dims(batch_nodes,
-1).astype(np.int64)
batch_labels = graph.node_feat['group_id'][batch_nodes].astype(
np.float32)
yield [batch_nodes_expanded, batch_labels]
return wrapper
def topk_f1_score(labels,
probs,
topk_list=None,
average="macro",
threshold=None):
assert topk_list is not None or threshold is not None, "one of topklist and threshold should not be None"
if threshold is not None:
preds = probs > threshold
else:
preds = np.zeros_like(labels, dtype=np.int64)
for idx, (prob, topk) in enumerate(zip(np.argsort(probs), topk_list)):
preds[idx][prob[-int(topk):]] = 1
return f1_score(labels, preds, average=average)
def main(args):
hidden_size = args.hidden_size
epoch = args.epoch
ckpt_path = args.ckpt_path
threshold = args.threshold
dataset = load(args.dataset)
if args.batch_size is None:
batch_size = len(dataset.train_index)
else:
batch_size = args.batch_size
train_steps = (len(dataset.train_index) // batch_size) * epoch
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_prog = fluid.Program()
test_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
with fluid.unique_name.guard():
train_pyreader, train_loss, train_probs, train_labels, train_topk = node_classify_model(
dataset.graph,
dataset.num_groups,
hidden_size=hidden_size,
name='train')
lr = l.polynomial_decay(0.025, train_steps, 0.0001)
adam = fluid.optimizer.Adam(lr)
adam.minimize(train_loss)
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, test_loss, test_probs, test_labels, test_topk = node_classify_model(
dataset.graph,
dataset.num_groups,
hidden_size=hidden_size,
name='test')
test_prog = test_prog.clone(for_test=True)
exe = fluid.Executor(place)
exe.run(startup_prog)
train_pyreader.decorate_tensor_provider(
node_classify_generator(
dataset.graph,
dataset.train_index,
batch_size=batch_size,
epoch=epoch))
test_pyreader.decorate_tensor_provider(
node_classify_generator(
dataset.graph, dataset.test_index, batch_size=batch_size, epoch=1))
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(ckpt_path, var.name))
fluid.io.load_vars(
exe, ckpt_path, main_program=train_prog, predicate=existed_params)
step = 0
prev_time = time.time()
train_pyreader.start()
while 1:
try:
train_loss_val, train_probs_val, train_labels_val, train_topk_val = exe.run(
train_prog,
fetch_list=[
train_loss, train_probs, train_labels, train_topk
],
return_numpy=True)
train_macro_f1 = topk_f1_score(train_labels_val, train_probs_val,
train_topk_val, "macro", threshold)
train_micro_f1 = topk_f1_score(train_labels_val, train_probs_val,
train_topk_val, "micro", threshold)
step += 1
log.info("Step %d " % step + "Train Loss: %f " % train_loss_val +
"Train Macro F1: %f " % train_macro_f1 +
"Train Micro F1: %f " % train_micro_f1)
except fluid.core.EOFException:
train_pyreader.reset()
break
test_pyreader.start()
test_probs_vals, test_labels_vals, test_topk_vals = [], [], []
while 1:
try:
test_loss_val, test_probs_val, test_labels_val, test_topk_val = exe.run(
test_prog,
fetch_list=[
test_loss, test_probs, test_labels, test_topk
],
return_numpy=True)
test_probs_vals.append(
test_probs_val), test_labels_vals.append(test_labels_val)
test_topk_vals.append(test_topk_val)
except fluid.core.EOFException:
test_pyreader.reset()
test_probs_array = np.concatenate(test_probs_vals)
test_labels_array = np.concatenate(test_labels_vals)
test_topk_array = np.concatenate(test_topk_vals)
test_macro_f1 = topk_f1_score(
test_labels_array, test_probs_array, test_topk_array,
"macro", threshold)
test_micro_f1 = topk_f1_score(
test_labels_array, test_probs_array, test_topk_array,
"micro", threshold)
log.info("\t\tStep %d " % step + "Test Loss: %f " %
test_loss_val + "Test Macro F1: %f " % test_macro_f1 +
"Test Micro F1: %f " % test_micro_f1)
break
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='node2vec')
parser.add_argument(
"--dataset",
type=str,
default="BlogCatalog",
help="dataset (BlogCatalog)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--epoch", type=int, default=400)
parser.add_argument("--batch_size", type=int, default=None)
parser.add_argument("--threshold", type=float, default=0.3)
parser.add_argument(
"--ckpt_path",
type=str,
default="./tmp/baseline_node2vec/paddle_model")
args = parser.parse_args()
log.info(args)
main(args)
# deepwalk config
num_nodes=10312 # max node_id + 1
num_sample_workers=2
epoch=100
optimizer=sgd # sgd or adam
learning_rate=0.5
neg_num=5
walk_len=40
win_size=5
dim=128
batch_size=8
steps_per_save=5000
is_sparse=False
distributed_embedding=False # only use when num_nodes > 100,000,000, slower than noraml embedding
build_train_data=True
pre_walk=False
CPU_NUM=16
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Reader file.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
import time
import io
import os
import numpy as np
import paddle
from pgl.utils.logger import log
from pgl.sample import node2vec_sample
from pgl.sample import deepwalk_sample
from pgl.sample import alias_sample
from pgl.graph_kernel import skip_gram_gen_pair
from pgl.graph_kernel import alias_sample_build_table
from pgl.utils import mp_reader
class DeepwalkReader(object):
def __init__(self,
graph,
batch_size=512,
walk_len=40,
win_size=5,
neg_num=5,
train_files=None,
walkpath_files=None,
neg_sample_type="average"):
"""
Args:
walkpath_files: if is not None, read walk path from walkpath_files
"""
self.graph = graph
self.batch_size = batch_size
self.walk_len = walk_len
self.win_size = win_size
self.neg_num = neg_num
self.train_files = train_files
self.walkpath_files = walkpath_files
self.neg_sample_type = neg_sample_type
def walk_from_files(self):
bucket = []
while True:
for filename in self.walkpath_files:
with io.open(filename) as inf:
for line in inf:
#walk = [hash_map[x] for x in line.strip('\n\t').split('\t')]
walk = [int(x) for x in line.strip('\n\t').split('\t')]
bucket.append(walk)
if len(bucket) == self.batch_size:
yield bucket
bucket = []
if len(bucket):
yield bucket
def walk_from_graph(self):
def node_generator():
if self.train_files is None:
while True:
for nodes in self.graph.node_batch_iter(self.batch_size):
yield nodes
else:
nodes = []
while True:
for filename in self.train_files:
with io.open(filename) as inf:
for line in inf:
node = int(line.strip('\n\t'))
nodes.append(node)
if len(nodes) == self.batch_size:
yield nodes
nodes = []
if len(nodes):
yield nodes
if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
log.info("Deepwalk using alias sample")
for nodes in node_generator():
if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
walks = deepwalk_sample(self.graph, nodes, self.walk_len,
"alias", "events")
else:
walks = deepwalk_sample(self.graph, nodes, self.walk_len)
yield walks
def walk_generator(self):
if self.walkpath_files is not None:
for i in self.walk_from_files():
yield i
else:
for i in self.walk_from_graph():
yield i
def __call__(self):
np.random.seed(os.getpid())
if self.neg_sample_type == "outdegree":
outdegree = self.graph.outdegree()
distribution = 1. * outdegree / outdegree.sum()
alias, events = alias_sample_build_table(distribution)
max_len = int(self.batch_size * self.walk_len * (
(1 + self.win_size) - 0.3))
for walks in self.walk_generator():
try:
src_list, pos_list = [], []
for walk in walks:
s, p = skip_gram_gen_pair(walk, self.win_size)
src_list.append(s[:max_len]), pos_list.append(p[:max_len])
src = [s for x in src_list for s in x]
pos = [s for x in pos_list for s in x]
src = np.array(src, dtype=np.int64),
pos = np.array(pos, dtype=np.int64)
src, pos = np.reshape(src, [-1, 1, 1]), np.reshape(pos,
[-1, 1, 1])
neg_sample_size = [len(pos), self.neg_num, 1]
if src.shape[0] == 0:
continue
if self.neg_sample_type == "average":
negs = np.random.randint(
low=0, high=self.graph.num_nodes, size=neg_sample_size)
elif self.neg_sample_type == "outdegree":
negs = alias_sample(neg_sample_size, alias, events)
elif self.neg_sample_type == "inbatch":
pass
dst = np.concatenate([pos, negs], 1)
# [batch_size, 1, 1] [batch_size, neg_num+1, 1]
yield src[:max_len], dst[:max_len]
except Exception as e:
log.exception(e)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Utils file.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import os
import time
import numpy as np
from pgl.utils.logger import log
from pgl.graph import Graph
from pgl.sample import graph_alias_sample_table
from reader import DeepwalkReader
import mp_reader
def get_file_list(path):
filelist = []
if os.path.isfile(path):
filelist = [path]
elif os.path.isdir(path):
filelist = [
os.path.join(dp, f)
for dp, dn, filenames in os.walk(path) for f in filenames
]
else:
raise ValueError(path + " not supported")
return filelist
def build_graph(num_nodes, edge_path):
filelist = []
if os.path.isfile(edge_path):
filelist = [edge_path]
elif os.path.isdir(edge_path):
filelist = [
os.path.join(dp, f)
for dp, dn, filenames in os.walk(edge_path) for f in filenames
]
else:
raise ValueError(edge_path + " not supported")
edges, edge_weight = [], []
for name in filelist:
with open(name) as inf:
for line in inf:
slots = line.strip("\n").split()
edges.append([slots[0], slots[1]])
edges.append([slots[1], slots[0]])
if len(slots) > 2:
edge_weight.extend([float(slots[2]), float(slots[2])])
edges = np.array(edges, dtype="int64")
assert num_nodes > edges.max(
), "Node id in any edges should be smaller then num_nodes!"
edge_feat = dict()
if len(edge_weight) == len(edges):
edge_feat["weight"] = np.array(edge_weight)
graph = Graph(num_nodes, edges, edge_feat=edge_feat)
log.info("Build graph done")
graph.outdegree()
del edges, edge_feat
log.info("Build graph index done")
if "weight" in graph.edge_feat:
graph.node_feat["alias"], graph.node_feat[
"events"] = graph_alias_sample_table(graph, "weight")
log.info("Build graph alias sample table done")
return graph
def build_fake_graph(num_nodes):
class FakeGraph():
pass
graph = FakeGraph()
graph.num_nodes = num_nodes
return graph
def build_gen_func(args, graph):
num_sample_workers = args.num_sample_workers
if args.walkpath_files is None or args.walkpath_files == "None":
walkpath_files = [None for _ in range(num_sample_workers)]
else:
files = get_file_list(args.walkpath_files)
walkpath_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(files):
walkpath_files[idx % num_sample_workers].append(f)
if args.train_files is None or args.train_files == "None":
train_files = [None for _ in range(num_sample_workers)]
else:
files = get_file_list(args.train_files)
train_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(files):
train_files[idx % num_sample_workers].append(f)
gen_func_pool = [
DeepwalkReader(
graph,
batch_size=args.batch_size,
walk_len=args.walk_len,
win_size=args.win_size,
neg_num=args.neg_num,
neg_sample_type=args.neg_sample_type,
walkpath_files=walkpath_files[i],
train_files=train_files[i]) for i in range(num_sample_workers)
]
if num_sample_workers == 1:
gen_func = gen_func_pool[0]
else:
gen_func = mp_reader.multiprocess_reader(
gen_func_pool, use_pipe=True, queue_size=100)
return gen_func
def test_gen_speed(gen_func):
cur_time = time.time()
for idx, _ in enumerate(gen_func()):
log.info("iter %s: %s s" % (idx, time.time() - cur_time))
cur_time = time.time()
if idx == 100:
break
# Distribute GraphSAGE in PGL
[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.
For purpose of high scalability, we use redis as distribute graph storage solution and training graphsage against redis server.
### Datasets(Quickstart)
The reddit dataset should be downloaded from [reddit_adj.npz](https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt) and [reddit.npz](https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2Jthe). The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
Alternatively, reddit dataset has been preprocessed and packed into docker image, which can be instantly pulled using following commands.
```sh
docker pull githubutilities/reddit_redis_demo:v0.1
```
### Dependencies
```txt
- paddlepaddle>=1.6
- pgl
- scipy
- redis==2.10.6
- redis-py-cluster==1.3.6
```
### How to run
#### 1. Start reddit data service
```sh
docker run \
--net=host \
-d --rm \
--name reddit_demo \
-it githubutilities/reddit_redis_demo:v0.1 \
/bin/bash -c "/bin/bash ./before_hook.sh && /bin/bash"
docker logs -f `docker ps -aqf "name=reddit_demo"`
```
#### 2. training GraphSAGE model
```sh
python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --sample_workers 10
```
#### Hyperparameters
- epoch: Number of epochs default (10)
- use_cuda: Use gpu if assign use_cuda.
- graphsage_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm".
- sample_workers: The number of workers for multiprocessing subgraph sample.
- lr: Learning rate.
- batch_size: Batch size.
- samples_1: The max neighbors for the first hop neighbor sampling. (default: 25)
- samples_2: The max neighbors for the second hop neighbor sampling. (default: 10)
- hidden_size: The hidden size of the GraphSAGE models.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
import paddle.fluid as fluid
def copy_send(src_feat, dst_feat, edge_feat):
return src_feat["h"]
def mean_recv(feat):
return fluid.layers.sequence_pool(feat, pool_type="average")
def sum_recv(feat):
return fluid.layers.sequence_pool(feat, pool_type="sum")
def max_recv(feat):
return fluid.layers.sequence_pool(feat, pool_type="max")
def lstm_recv(feat):
hidden_dim = 128
forward, _ = fluid.layers.dynamic_lstm(
input=feat, size=hidden_dim * 4, use_peepholes=False)
output = fluid.layers.sequence_last_step(forward)
return output
def graphsage_mean(gw, feature, hidden_size, act, name):
msg = gw.send(copy_send, nfeat_list=[("h", feature)])
neigh_feature = gw.recv(msg, mean_recv)
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_meanpool(gw,
feature,
hidden_size,
act,
name,
inner_hidden_size=512):
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
neigh_feature = gw.recv(msg, mean_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_maxpool(gw,
feature,
hidden_size,
act,
name,
inner_hidden_size=512):
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
neigh_feature = gw.recv(msg, max_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_lstm(gw, feature, hidden_size, act, name):
inner_hidden_size = 128
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
hidden_dim = 128
forward_proj = fluid.layers.fc(input=neigh_feature,
size=hidden_dim * 4,
bias_attr=False,
name="lstm_proj")
msg = gw.send(copy_send, nfeat_list=[("h", forward_proj)])
neigh_feature = gw.recv(msg, lstm_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import pickle as pkl
import paddle
import paddle.fluid as fluid
import socket
import pgl
import time
from pgl.utils import mp_reader
from pgl.utils.logger import log
from pgl import redis_graph
def node_batch_iter(nodes, node_label, batch_size):
"""node_batch_iter
"""
perm = np.arange(len(nodes))
np.random.shuffle(perm)
start = 0
while start < len(nodes):
index = perm[start:start + batch_size]
start += batch_size
yield nodes[index], node_label[index]
def traverse(item):
"""traverse
"""
if isinstance(item, list) or isinstance(item, np.ndarray):
for i in iter(item):
for j in traverse(i):
yield j
else:
yield item
def flat_node_and_edge(nodes, eids):
"""flat_node_and_edge
"""
nodes = list(set(traverse(nodes)))
eids = list(set(traverse(eids)))
return nodes, eids
def worker(batch_info, graph_wrapper, samples):
"""Worker
"""
def work():
"""work
"""
redis_configs = [{
"host": socket.gethostbyname(socket.gethostname()),
"port": 7430
}, ]
graph = redis_graph.RedisGraph("sub_graph", redis_configs, 64)
first = True
for batch_train_samples, batch_train_labels in batch_info:
start_nodes = batch_train_samples
nodes = start_nodes
eids = []
eid2edges = {}
for max_deg in samples:
pred, pred_eid = graph.sample_predecessor(
start_nodes, max_degree=max_deg, return_eids=True)
for _dst, _srcs, _eids in zip(start_nodes, pred, pred_eid):
for _src, _eid in zip(_srcs, _eids):
eid2edges[_eid] = (_src, _dst)
last_nodes = nodes
nodes = [nodes, pred]
eids = [eids, pred_eid]
nodes, eids = flat_node_and_edge(nodes, eids)
# Find new nodes
start_nodes = list(set(nodes) - set(last_nodes))
if len(start_nodes) == 0:
break
subgraph = graph.subgraph(
nodes=nodes, eid=eids, edges=[eid2edges[e] for e in eids])
sub_node_index = subgraph.reindex_from_parrent_nodes(
batch_train_samples)
feed_dict = graph_wrapper.to_feed(subgraph)
feed_dict["node_label"] = np.expand_dims(
np.array(
batch_train_labels, dtype="int64"), -1)
feed_dict["node_index"] = sub_node_index
yield feed_dict
return work
def multiprocess_graph_reader(graph_wrapper,
samples,
node_index,
batch_size,
node_label,
num_workers=4):
"""multiprocess_graph_reader
"""
def parse_to_subgraph(rd):
"""parse_to_subgraph
"""
def work():
"""work
"""
last = time.time()
for data in rd():
this = time.time()
feed_dict = data
now = time.time()
last = now
yield feed_dict
return work
def reader():
"""reader"""
batch_info = list(
node_batch_iter(
node_index, node_label, batch_size=batch_size))
block_size = int(len(batch_info) / num_workers + 1)
reader_pool = []
for i in range(num_workers):
reader_pool.append(
worker(batch_info[block_size * i:block_size * (i + 1)],
graph_wrapper, samples))
multi_process_sample = mp_reader.multiprocess_reader(
reader_pool, use_pipe=True, queue_size=1000)
r = parse_to_subgraph(multi_process_sample)
return paddle.reader.buffered(r, 1000)
return reader()
scipy
redis==2.10.6
redis-py-cluster==1.3.6
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import time
import numpy as np
import scipy.sparse as sp
from sklearn.preprocessing import StandardScaler
import pgl
from pgl.utils.logger import log
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import reader
from model import graphsage_mean, graphsage_meanpool,\
graphsage_maxpool, graphsage_lstm
def load_data():
"""
data from https://github.com/matenure/FastGCN/issues/8
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
reddit_index_label is preprocess from reddit.npz without feats key.
"""
data_dir = os.path.dirname(os.path.abspath(__file__))
data = np.load(os.path.join(data_dir, "data/reddit_index_label.npz"))
num_class = 41
train_label = data['y_train']
val_label = data['y_val']
test_label = data['y_test']
train_index = data['train_index']
val_index = data['val_index']
test_index = data['test_index']
return {
"train_index": train_index,
"train_label": train_label,
"val_label": val_label,
"val_index": val_index,
"test_index": test_index,
"test_label": test_label,
"num_class": 41
}
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size):
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int64", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
#feature = fluid.layers.gather(feature, graph_wrapper.node_feat['feats'])
feature = graph_wrapper.node_feat['feats']
feature.stop_gradient = True
for i in range(k_hop):
if graphsage_type == 'graphsage_mean':
feature = graphsage_mean(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_mean_%s" % i)
elif graphsage_type == 'graphsage_meanpool':
feature = graphsage_meanpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_meanpool_%s" % i)
elif graphsage_type == 'graphsage_maxpool':
feature = graphsage_maxpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
elif graphsage_type == 'graphsage_lstm':
feature = graphsage_lstm(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
else:
raise ValueError("graphsage type %s is not"
" implemented" % graphsage_type)
feature = fluid.layers.gather(feature, node_index)
logits = fluid.layers.fc(feature,
num_class,
act=None,
name='classification_layer')
proba = fluid.layers.softmax(logits)
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=node_label)
loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
return loss, acc
def run_epoch(batch_iter,
exe,
program,
prefix,
model_loss,
model_acc,
epoch,
log_per_step=100):
batch = 0
total_loss = 0.
total_acc = 0.
total_sample = 0
start = time.time()
for batch_feed_dict in batch_iter():
batch += 1
batch_loss, batch_acc = exe.run(program,
fetch_list=[model_loss, model_acc],
feed=batch_feed_dict)
if batch % log_per_step == 0:
log.info("Batch %s %s-Loss %s %s-Acc %s" %
(batch, prefix, batch_loss, prefix, batch_acc))
num_samples = len(batch_feed_dict["node_index"])
total_loss += batch_loss * num_samples
total_acc += batch_acc * num_samples
total_sample += num_samples
end = time.time()
log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
(prefix, epoch, total_loss / total_sample,
total_acc / total_sample, (end - start) / batch))
def main(args):
data = load_data()
log.info("preprocess finish")
log.info("Train Examples: %s" % len(data["train_index"]))
log.info("Val Examples: %s" % len(data["val_index"]))
log.info("Test Examples: %s" % len(data["test_index"]))
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
samples = []
if args.samples_1 > 0:
samples.append(args.samples_1)
if args.samples_2 > 0:
samples.append(args.samples_2)
with fluid.program_guard(train_program, startup_program):
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph",
fluid.CPUPlace(),
node_feat=[('feats', [None, 602], np.dtype('float32'))])
model_loss, model_acc = build_graph_model(
graph_wrapper,
num_class=data["num_class"],
hidden_size=args.hidden_size,
graphsage_type=args.graphsage_type,
k_hop=len(samples))
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=args.lr)
adam.minimize(model_loss)
exe = fluid.Executor(place)
exe.run(startup_program)
train_iter = reader.multiprocess_graph_reader(
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
val_iter = reader.multiprocess_graph_reader(
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
test_iter = reader.multiprocess_graph_reader(
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
for epoch in range(args.epoch):
run_epoch(
train_iter,
program=train_program,
exe=exe,
prefix="train",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=1,
epoch=epoch)
run_epoch(
val_iter,
program=test_program,
exe=exe,
prefix="val",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
run_epoch(
test_iter,
program=test_program,
prefix="test",
exe=exe,
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='graphsage')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
"--normalize", action='store_true', help="normalize features")
parser.add_argument(
"--symmetry", action='store_true', help="undirect graph")
parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
parser.add_argument("--sample_workers", type=int, default=10)
parser.add_argument("--epoch", type=int, default=10)
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--lr", type=float, default=0.01)
parser.add_argument("--samples_1", type=int, default=25)
parser.add_argument("--samples_2", type=int, default=10)
args = parser.parse_args()
log.info(args)
main(args)
......@@ -26,24 +26,25 @@ def gat_layer(graph_wrapper, node_feature, hidden_size):
return output
```
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~83% | 0.0188s | 0.0175s |
| Pubmed | ~78% | 0.0449s | 0.0295s |
| Citeseer | ~70% | 0.0275 | 0.0253s |
| Dataset | Accuracy |
| --- | --- |
| Cora | ~83% |
| Pubmed | ~78% |
| Citeseer | ~70% |
### How to run
......
......@@ -68,7 +68,7 @@ def main(args):
node_index = fluid.layers.data(
"node_index",
shape=[None, 1],
dtype="int32",
dtype="int64",
append_batch_size=False)
node_label = fluid.layers.data(
"node_label",
......@@ -111,7 +111,7 @@ def main(args):
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
feed_dict["node_index"] = np.array(train_index, dtype="int32")
feed_dict["node_index"] = np.array(train_index, dtype="int64")
feed_dict["node_label"] = np.array(train_label, dtype="int64")
train_loss, train_acc = exe.run(train_program,
feed=feed_dict,
......@@ -121,7 +121,7 @@ def main(args):
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
feed_dict["node_index"] = np.array(val_index, dtype="int32")
feed_dict["node_index"] = np.array(val_index, dtype="int64")
feed_dict["node_label"] = np.array(val_label, dtype="int64")
val_loss, val_acc = exe.run(test_program,
feed=feed_dict,
......@@ -132,7 +132,7 @@ def main(args):
"Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+ "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
feed_dict["node_index"] = np.array(test_index, dtype="int32")
feed_dict["node_index"] = np.array(test_index, dtype="int64")
feed_dict["node_label"] = np.array(test_label, dtype="int64")
test_loss, test_acc = exe.run(test_program,
feed=feed_dict,
......
......@@ -26,18 +26,18 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~81% | 0.0106s | 0.0104s |
| Pubmed | ~79% | 0.0210s | 0.0154s |
| Citeseer | ~71% | 0.0175s | 0.0177s |
| Dataset | Accuracy |
| --- | --- |
| Cora | ~81% |
| Pubmed | ~79% |
| Citeseer | ~71% |
### How to run
......
......@@ -70,7 +70,7 @@ def main(args):
node_index = fluid.layers.data(
"node_index",
shape=[None, 1],
dtype="int32",
dtype="int64",
append_batch_size=False)
node_label = fluid.layers.data(
"node_label",
......@@ -113,7 +113,7 @@ def main(args):
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
feed_dict["node_index"] = np.array(train_index, dtype="int32")
feed_dict["node_index"] = np.array(train_index, dtype="int64")
feed_dict["node_label"] = np.array(train_label, dtype="int64")
train_loss, train_acc = exe.run(train_program,
feed=feed_dict,
......@@ -123,7 +123,7 @@ def main(args):
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
feed_dict["node_index"] = np.array(val_index, dtype="int32")
feed_dict["node_index"] = np.array(val_index, dtype="int64")
feed_dict["node_label"] = np.array(val_label, dtype="int64")
val_loss, val_acc = exe.run(test_program,
feed=feed_dict,
......@@ -134,7 +134,7 @@ def main(args):
"Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+ "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
feed_dict["node_index"] = np.array(test_index, dtype="int32")
feed_dict["node_index"] = np.array(test_index, dtype="int64")
feed_dict["node_label"] = np.array(test_label, dtype="int64")
test_loss, test_acc = exe.run(test_program,
feed=feed_dict,
......
# PGL Examples for GES
[Graph Embedding with Side Information](https://arxiv.org/pdf/1803.02349.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce ges algorithms.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3).
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
## How to run
For examples, train ges on cora dataset.
```sh
# train deepwalk in distributed mode.
sh gpu_run.sh
```
## Hyperparameters
- dataset: The citation dataset "BlogCatalog".
- hidden_size: Hidden size of the embedding.
- lr: Learning rate.
- neg_num: Number of negative samples.
- epoch: Number of training epoch.
#!/bin/bash
export FLAGS_sync_nccl_allreduce=1
export FLAGS_eager_delete_tensor_gb=0
export FLAGS_fraction_of_gpu_memory_to_use=1
export NCCL_DEBUG=INFO
export NCCL_IB_GID_INDEX=3
export GLOG_v=1
export GLOG_logtostderr=1
num_nodes=10312
num_embedding=10351
num_sample_workers=20
# build train_data
rm -rf train_data && mkdir -p train_data
cd train_data
seq 0 $((num_nodes-1)) | shuf | split -l $((num_nodes/num_sample_workers+1))
cd -
python3 gpu_train.py --output_path ./output --epoch 100 --walk_len 40 --win_size 5 --neg_num 5 --batch_size 128 --hidden_size 128 \
--num_nodes $num_nodes --num_embedding $num_embedding --num_sample_workers $num_sample_workers --steps_per_save 2000 --dataset "BlogCatalog"
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" gpu_train
"""
import argparse
import time
import os
import glob
import numpy as np
import paddle.fluid as F
import paddle.fluid.layers as L
from pgl.utils.logger import log
from pgl.graph import Graph
from pgl.sample import graph_alias_sample_table
from pgl import data_loader
import mp_reader
from reader import GESReader
from model import GESModel
def get_file_list(path):
"""get_file_list
"""
filelist = []
if os.path.isfile(path):
filelist = [path]
elif os.path.isdir(path):
filelist = [
os.path.join(dp, f)
for dp, dn, filenames in os.walk(path) for f in filenames
]
else:
raise ValueError(path + " not supported")
return filelist
def build_graph(num_nodes, edge_path, output_path, undigraph=True):
""" build_graph
"""
edge_file = os.path.join(output_path, "edge.npy")
edge_weight_file = os.path.join(output_path, "edge_weight.npy")
alias_file = os.path.join(output_path, "alias.npy")
events_file = os.path.join(output_path, "events.npy")
if os.path.isfile(edge_file):
edges = np.load(edge_file)
edge_feat = dict()
if os.path.isfile(edge_weight_file):
log.info("Loading weight from cache")
edge_feat["weight"] = np.load(edge_weight_file, allow_pickle=True)
node_feat = dict()
if os.path.isfile(alias_file):
log.info("Loading alias from cache")
node_feat["alias"] = np.load(alias_file, allow_pickle=True)
if os.path.isfile(events_file):
log.info("Loading events from cache")
node_feat["events"] = np.load(events_file, allow_pickle=True)
else:
filelist = get_file_list(edge_path)
edges, edge_weight = [], []
log.info("Reading edge files")
for name in filelist:
with open(name) as inf:
for line in inf:
slots = line.strip("\n").split()
edges.append([slots[0], slots[1]])
if len(slots) > 2:
edge_weight.append(slots[2])
edges = np.array(edges, dtype="int64")
assert num_nodes > edges.max(
), "Node id in any edges should be smaller then num_nodes!"
log.info("Read edge files done.")
edge_feat = dict()
node_feat = dict()
if len(edge_weight) == len(edges):
edge_feat["weight"] = np.array(edge_weight, dtype="float32")
if undigraph is True:
edges = np.concatenate([edges, edges[:, [1, 0]]], 0)
if "weight" in edge_feat:
edge_feat["weight"] = np.concatenate(
[edge_feat["weight"], edge_feat["weight"]],
0).astype("float64")
graph = Graph(num_nodes, edges, node_feat, edge_feat=edge_feat)
log.info("Build graph done")
graph.outdegree()
log.info("Build graph index done")
if "weight" in graph.edge_feat and "alias" not in graph.node_feat and "events" not in graph.node_feat:
graph.node_feat["alias"], graph.node_feat[
"events"] = graph_alias_sample_table(graph, "weight")
log.info(
"Build graph alias sample table done, and saving alias & evnets cache"
)
np.save(alias_file, graph.node_feat["alias"])
np.save(events_file, graph.node_feat["events"])
return graph
def optimization(base_lr, loss, train_steps, optimizer='adam'):
""" optimization
"""
decayed_lr = L.polynomial_decay(base_lr, train_steps, 0.0001)
if optimizer == 'sgd':
optimizer = F.optimizer.SGD(
decayed_lr,
regularization=F.regularizer.L2DecayRegularizer(
regularization_coeff=0.0025))
elif optimizer == 'adam':
# dont use gpu's lazy mode
optimizer = F.optimizer.Adam(decayed_lr)
else:
raise ValueError
log.info('learning rate:%f' % (base_lr))
optimizer.minimize(loss)
def build_gen_func(args, graph, node_feat):
""" build_gen_func
"""
num_sample_workers = args.num_sample_workers
if args.walkpath_files is None:
walkpath_files = [None for _ in range(num_sample_workers)]
else:
files = get_file_list(args.walkpath_files)
walkpath_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(files):
walkpath_files[idx % num_sample_workers].append(f)
if args.train_files is None:
train_files = [None for _ in range(num_sample_workers)]
else:
files = get_file_list(args.train_files)
train_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(files):
train_files[idx % num_sample_workers].append(f)
gen_func_pool = [
GESReader(
graph,
node_feat,
batch_size=args.batch_size,
walk_len=args.walk_len,
win_size=args.win_size,
neg_num=args.neg_num,
neg_sample_type=args.neg_sample_type,
walkpath_files=walkpath_files[i],
train_files=train_files[i]) for i in range(num_sample_workers)
]
if num_sample_workers == 1:
gen_func = gen_func_pool[0]
else:
gen_func = mp_reader.multiprocess_reader(
gen_func_pool, use_pipe=True, queue_size=100)
return gen_func
def get_parallel_exe(program, loss):
""" get_parallel_exe
"""
exec_strategy = F.ExecutionStrategy()
exec_strategy.num_threads = 1 #2 for fp32 4 for fp16
exec_strategy.use_experimental_executor = True
exec_strategy.num_iteration_per_drop_scope = 10 #important shit
build_strategy = F.BuildStrategy()
build_strategy.enable_inplace = True
build_strategy.memory_optimize = True
build_strategy.remove_unnecessary_lock = True
#return compiled_prog
train_exe = F.ParallelExecutor(
use_cuda=True,
loss_name=loss.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy,
main_program=program)
return train_exe
def train(train_exe, exe, program, loss, node2vec_pyreader, args, train_steps):
""" train
"""
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
step = 0
while True:
try:
begin_time = time.time()
loss_val, = train_exe.run(fetch_list=[loss])
log.info("step %s: loss %.5f speed: %.5f s/step" %
(step, np.mean(loss_val), time.time() - begin_time))
step += 1
except F.core.EOFException:
node2vec_pyreader.reset()
if (step % args.steps_per_save == 0 or
step == train_steps) and trainer_id == 0:
model_save_dir = args.output_path
model_path = os.path.join(model_save_dir, str(step))
if not os.path.exists(model_save_dir):
os.makedirs(model_save_dir)
F.io.save_params(exe, model_path, program)
if step == train_steps:
break
def test_gen_speed(gen_func):
""" test_gen_speed
"""
cur_time = time.time()
for idx, _ in enumerate(gen_func()):
log.info("iter %s: %s s" % (idx, time.time() - cur_time))
cur_time = time.time()
if idx == 100:
break
def main(args):
""" main
"""
import logging
log.setLevel(logging.DEBUG)
log.info("start")
if args.dataset is not None:
if args.dataset == "BlogCatalog":
graph = data_loader.BlogCatalogDataset().graph
else:
raise ValueError(args.dataset + " dataset doesn't exists")
log.info("Load buildin BlogCatalog dataset done.")
node_feat = np.expand_dims(graph.node_feat["group_id"].argmax(-1),
-1) + graph.num_nodes
args.num_nodes = graph.num_nodes
args.num_embedding = graph.num_nodes + graph.node_feat[
"group_id"].shape[-1]
else:
graph = build_graph(args.num_nodes, args.edge_path, args.output_path)
node_feat = np.load(args.node_feat_npy)
model = GESModel(args.num_embedding, node_feat.shape[1] + 1,
args.hidden_size, args.neg_num, False, 2)
pyreader = model.pyreader
loss = model.forward()
num_devices = len(F.cuda_places())
train_steps = int(args.num_nodes * args.epoch / args.batch_size /
num_devices)
log.info("Train steps: %s" % train_steps)
optimization(args.lr * num_devices, loss, train_steps, args.optimizer)
place = F.CUDAPlace(0)
exe = F.Executor(place)
exe.run(F.default_startup_program())
gen_func = build_gen_func(args, graph, node_feat)
pyreader.decorate_tensor_provider(gen_func)
pyreader.start()
train_prog = F.default_main_program()
train_exe = get_parallel_exe(train_prog, loss)
train(train_exe, exe, train_prog, loss, pyreader, args, train_steps)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Deepwalk')
parser.add_argument("--hidden_size", type=int, default=64)
parser.add_argument("--lr", type=float, default=0.025)
parser.add_argument("--neg_num", type=int, default=5)
parser.add_argument("--epoch", type=int, default=100)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--walk_len", type=int, default=40)
parser.add_argument("--win_size", type=int, default=5)
parser.add_argument("--output_path", type=str, default="output")
parser.add_argument("--num_sample_workers", type=int, default=1)
parser.add_argument("--steps_per_save", type=int, default=3000)
parser.add_argument("--num_nodes", type=int, default=10000)
parser.add_argument("--num_embedding", type=int, default=10000)
parser.add_argument("--edge_path", type=str, default="./graph_data")
parser.add_argument("--walkpath_files", type=str, default=None)
parser.add_argument("--train_files", type=str, default="./train_data")
parser.add_argument("--node_feat_npy", type=str, default="./feat.npy")
parser.add_argument("--dataset", type=str, default=None)
parser.add_argument(
"--neg_sample_type",
type=str,
default="average",
choices=["average", "outdegree"])
parser.add_argument(
"--optimizer",
type=str,
required=False,
choices=['adam', 'sgd'],
default="adam")
args = parser.parse_args()
log.info(args)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
GES model file.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import math
import paddle.fluid.layers as L
import paddle.fluid as F
def split_embedding(input,
dict_size,
hidden_size,
initializer,
name,
num_part=16,
is_sparse=False,
learning_rate=1.0):
""" split_embedding
"""
_part_size = hidden_size // num_part
if hidden_size % num_part != 0:
_part_size += 1
output_embedding = []
p_num = 0
while hidden_size > 0:
_part_size = min(_part_size, hidden_size)
hidden_size -= _part_size
print("part", p_num, "size=", (dict_size, _part_size))
part_embedding = L.embedding(
input=input,
size=(dict_size, _part_size),
is_sparse=is_sparse,
is_distributed=False,
param_attr=F.ParamAttr(
name=name + '_part%s' % p_num,
initializer=initializer,
learning_rate=learning_rate))
p_num += 1
output_embedding.append(part_embedding)
return L.concat(output_embedding, -1)
class GESModel(object):
""" GESModel
"""
def __init__(self,
num_nodes,
num_featuers,
hidden_size=16,
neg_num=5,
is_sparse=False,
num_part=1):
self.pyreader = L.py_reader(
capacity=70,
shapes=[[-1, 1, num_featuers, 1],
[-1, neg_num + 1, num_featuers, 1]],
dtypes=['int64', 'int64'],
lod_levels=[0, 0],
name='train',
use_double_buffer=True)
self.num_nodes = num_nodes
self.num_featuers = num_featuers
self.neg_num = neg_num
self.embed_init = F.initializer.TruncatedNormal(scale=1.0 /
math.sqrt(hidden_size))
self.is_sparse = is_sparse
self.num_part = num_part
self.hidden_size = hidden_size
self.loss = None
def forward(self):
""" forward
"""
src, dst = L.read_file(self.pyreader)
if self.is_sparse:
# sparse mode use 2 dims input.
src = L.reshape(src, [-1, 1])
dst = L.reshape(dst, [-1, 1])
src_embed = split_embedding(src, self.num_nodes, self.hidden_size,
self.embed_init, "weight", self.num_part,
self.is_sparse)
dst_embed = split_embedding(dst, self.num_nodes, self.hidden_size,
self.embed_init, "weight", self.num_part,
self.is_sparse)
if self.is_sparse:
src_embed = L.reshape(
src_embed, [-1, 1, self.num_featuers, self.hidden_size])
dst_embed = L.reshape(
dst_embed,
[-1, self.neg_num + 1, self.num_featuers, self.hidden_size])
src_embed = L.reduce_mean(src_embed, 2)
dst_embed = L.reduce_mean(dst_embed, 2)
logits = L.matmul(
src_embed, dst_embed,
transpose_y=True) # [batch_size, 1, neg_num+1]
pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", 1)
neg_label = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 0)
label = L.concat([pos_label, neg_label], -1)
pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", self.neg_num)
neg_weight = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 1)
weight = L.concat([pos_weight, neg_weight], -1)
weight.stop_gradient = True
label.stop_gradient = True
loss = L.sigmoid_cross_entropy_with_logits(logits, label)
loss = loss * weight
loss = L.reduce_mean(loss)
loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
loss.persistable = True
self.loss = loss
return loss
class EGESModel(GESModel):
""" EGESModel
"""
def forward(self):
""" forward
"""
src, dst = L.read_file(self.pyreader)
src_id = L.slice(src, [0, 1, 2, 3], [0, 0, 0, 0],
[int(math.pow(2, 30)) - 1, 1, 1, 1])
dst_id = L.slice(dst, [0, 1, 2, 3], [0, 0, 0, 0],
[int(math.pow(2, 30)) - 1, self.neg_num + 1, 1, 1])
if self.is_sparse:
# sparse mode use 2 dims input.
src = L.reshape(src, [-1, 1])
dst = L.reshape(dst, [-1, 1])
# [b, 1, f, h]
src_embed = split_embedding(src, self.num_nodes, self.hidden_size,
self.embed_init, "weight", self.num_part,
self.is_sparse)
# [b, n+1, f, h]
dst_embed = split_embedding(dst, self.num_nodes, self.hidden_size,
self.embed_init, "weight", self.num_part,
self.is_sparse)
if self.is_sparse:
src_embed = L.reshape(
src_embed, [-1, 1, self.num_featuers, self.hidden_size])
dst_embed = L.reshape(
dst_embed,
[-1, self.neg_num + 1, self.num_featuers, self.hidden_size])
# [b, 1, 1, f]
src_weight = L.softmax(
L.embedding(
src_id, [self.num_nodes, self.num_featuers],
param_attr=F.ParamAttr(name="alpha")))
# [b, n+1, 1, f]
dst_weight = L.softmax(
L.embedding(
dst_id, [self.num_nodes, self.num_featuers],
param_attr=F.ParamAttr(name="alpha")))
# [b, 1, h]
src_sum = L.squeeze(L.matmul(src_weight, src_embed), axes=[2])
# [b, n+1, h]
dst_sum = L.squeeze(L.matmul(dst_weight, dst_embed), axes=[2])
logits = L.matmul(
src_sum, dst_sum, transpose_y=True) # [batch_size, 1, neg_num+1]
pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", 1)
neg_label = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 0)
label = L.concat([pos_label, neg_label], -1)
pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", self.neg_num)
neg_weight = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 1)
weight = L.concat([pos_weight, neg_weight], -1)
weight.stop_gradient = True
label.stop_gradient = True
loss = L.sigmoid_cross_entropy_with_logits(logits, label)
loss = loss * weight
loss = L.reduce_mean(loss)
loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
loss.persistable = True
self.loss = loss
return loss
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimized Multiprocessing Reader for PaddlePaddle
"""
import multiprocessing
import numpy as np
import time
import paddle.fluid as fluid
import pyarrow
def _serialize_serializable(obj):
"""Serialize Feed Dict
"""
return {"type": type(obj), "data": obj.__dict__}
def _deserialize_serializable(obj):
"""Deserialize Feed Dict
"""
val = obj["type"].__new__(obj["type"])
val.__dict__.update(obj["data"])
return val
context = pyarrow.default_serialization_context()
context.register_type(
object,
"object",
custom_serializer=_serialize_serializable,
custom_deserializer=_deserialize_serializable)
def serialize_data(data):
"""serialize_data"""
return pyarrow.serialize(data, context=context).to_buffer().to_pybytes()
def deserialize_data(data):
"""deserialize_data"""
return pyarrow.deserialize(data, context=context)
def multiprocess_reader(readers, use_pipe=True, queue_size=1000):
"""
multiprocess_reader use python multi process to read data from readers
and then use multiprocess.Queue or multiprocess.Pipe to merge all
data. The process number is equal to the number of input readers, each
process call one reader.
Multiprocess.Queue require the rw access right to /dev/shm, some
platform does not support.
you need to create multiple readers first, these readers should be independent
to each other so that each process can work independently.
An example:
.. code-block:: python
reader0 = reader(["file01", "file02"])
reader1 = reader(["file11", "file12"])
reader1 = reader(["file21", "file22"])
reader = multiprocess_reader([reader0, reader1, reader2],
queue_size=100, use_pipe=False)
"""
assert type(readers) is list and len(readers) > 0
def _read_into_queue(reader, queue):
"""read_into_queue"""
for sample in reader():
if sample is None:
raise ValueError("sample has None")
queue.put(serialize_data(sample))
queue.put(serialize_data(None))
def queue_reader():
"""queue_reader"""
queue = multiprocessing.Queue(queue_size)
for reader in readers:
p = multiprocessing.Process(
target=_read_into_queue, args=(reader, queue))
p.start()
reader_num = len(readers)
finish_num = 0
while finish_num < reader_num:
sample = deserialize_data(queue.get())
if sample is None:
finish_num += 1
else:
yield sample
def _read_into_pipe(reader, conn):
"""read_into_pipe"""
for sample in reader():
if sample is None:
raise ValueError("sample has None!")
conn.send(serialize_data(sample))
conn.send(serialize_data(None))
conn.close()
def pipe_reader():
"""pipe_reader"""
conns = []
for reader in readers:
parent_conn, child_conn = multiprocessing.Pipe()
conns.append(parent_conn)
p = multiprocessing.Process(
target=_read_into_pipe, args=(reader, child_conn))
p.start()
reader_num = len(readers)
finish_num = 0
conn_to_remove = []
finish_flag = np.zeros(len(conns), dtype="int32")
while finish_num < reader_num:
for conn_id, conn in enumerate(conns):
if finish_flag[conn_id] > 0:
continue
buff = conn.recv()
now = time.time()
sample = deserialize_data(buff)
out = time.time() - now
if sample is None:
finish_num += 1
conn.close()
finish_flag[conn_id] = 1
else:
yield sample
if use_pipe:
return pipe_reader
else:
return queue_reader
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Reader file.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
import time
import io
import os
import numpy as np
import paddle
from pgl.utils.logger import log
from pgl.sample import node2vec_sample
from pgl.sample import deepwalk_sample
from pgl.sample import alias_sample
from pgl.graph_kernel import skip_gram_gen_pair
from pgl.graph_kernel import alias_sample_build_table
class GESReader(object):
""" GESReader
"""
def __init__(self,
graph,
node_feat,
batch_size=512,
walk_len=40,
win_size=5,
neg_num=5,
train_files=None,
walkpath_files=None,
neg_sample_type="average"):
"""
Args:
walkpath_files: if is not None, read walk path from walkpath_files
"""
self.graph = graph
self.node_feat = node_feat
self.batch_size = batch_size
self.walk_len = walk_len
self.win_size = win_size
self.neg_num = neg_num
self.train_files = train_files
self.walkpath_files = walkpath_files
self.neg_sample_type = neg_sample_type
def walk_from_files(self):
""" walk_from_files
"""
bucket = []
while True:
for filename in self.walkpath_files:
with io.open(filename) as inf:
for line in inf:
walk = [int(x) for x in line.strip('\n\t').split('\t')]
bucket.append(walk)
if len(bucket) == self.batch_size:
yield bucket
bucket = []
if len(bucket):
yield bucket
def walk_from_graph(self):
""" walk_from_graph
"""
def node_generator():
""" node_generator
"""
if self.train_files is None:
while True:
for nodes in self.graph.node_batch_iter(self.batch_size):
yield nodes
else:
nodes = []
while True:
for filename in self.train_files:
with io.open(filename) as inf:
for line in inf:
node = int(line.strip('\n\t'))
nodes.append(node)
if len(nodes) == self.batch_size:
yield nodes
nodes = []
if len(nodes):
yield nodes
if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
log.info("Deepwalk using alias sample")
for nodes in node_generator():
if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
walks = deepwalk_sample(self.graph, nodes, self.walk_len,
"alias", "events")
else:
walks = deepwalk_sample(self.graph, nodes, self.walk_len)
yield walks
def walk_generator(self):
""" walk_generator
"""
if self.walkpath_files is not None:
for i in self.walk_from_files():
yield i
else:
for i in self.walk_from_graph():
yield i
def __call__(self):
np.random.seed(os.getpid())
if self.neg_sample_type == "outdegree":
outdegree = self.graph.outdegree()
distribution = 1. * outdegree / outdegree.sum()
alias, events = alias_sample_build_table(distribution)
max_len = int(self.batch_size * self.walk_len * (
(1 + self.win_size) - 0.3))
for walks in self.walk_generator():
src, pos = [], []
for walk in walks:
s, p = skip_gram_gen_pair(walk, self.win_size)
src.extend(s), pos.extend(p)
src = np.array(src, dtype=np.int64),
pos = np.array(pos, dtype=np.int64)
src, pos = np.reshape(src, [-1, 1, 1]), np.reshape(pos, [-1, 1, 1])
if src.shape[0] == 0:
continue
neg_sample_size = [len(pos), self.neg_num, 1]
if self.neg_sample_type == "average":
negs = self.graph.sample_nodes(neg_sample_size)
elif self.neg_sample_type == "outdegree":
negs = alias_sample(neg_sample_size, alias, events)
# [batch_size, 1, 1] [batch_size, neg_num+1, 1]
dst = np.concatenate([pos, negs], 1)
src_feat = np.concatenate([src, self.node_feat[src[:, :, 0]]], -1)
dst_feat = np.concatenate([dst, self.node_feat[dst[:, :, 0]]], -1)
src_feat, dst_feat = np.expand_dims(src_feat, -1), np.expand_dims(
dst_feat, -1)
yield src_feat[:max_len], dst_feat[:max_len]
......@@ -12,17 +12,23 @@ The reddit dataset should be downloaded from the following links and placed in d
### Dependencies
- sklearn
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### How to run
To train a GraphSAGE model on Reddit Dataset, you can just run
```
python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry
```
If you want to train a GraphSAGE model with multiple GPUs, you can just run
```
CUDA_VISIBLE_DEVICES=0,1 python train_multi.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry --num_trainer 2
```
#### Hyperparameters
- epoch: Number of epochs default (10)
......
......@@ -17,12 +17,15 @@ import paddle
import paddle.fluid as fluid
import pgl
import time
from pgl.utils import mp_reader
from pgl.utils.logger import log
import train
import time
def node_batch_iter(nodes, node_label, batch_size):
"""node_batch_iter
"""
perm = np.arange(len(nodes))
np.random.shuffle(perm)
start = 0
......@@ -33,6 +36,8 @@ def node_batch_iter(nodes, node_label, batch_size):
def traverse(item):
"""traverse
"""
if isinstance(item, list) or isinstance(item, np.ndarray):
for i in iter(item):
for j in traverse(i):
......@@ -42,13 +47,21 @@ def traverse(item):
def flat_node_and_edge(nodes, eids):
"""flat_node_and_edge
"""
nodes = list(set(traverse(nodes)))
eids = list(set(traverse(eids)))
return nodes, eids
def worker(batch_info, graph, samples):
def worker(batch_info, graph, graph_wrapper, samples):
"""Worker
"""
def work():
"""work
"""
first = True
for batch_train_samples, batch_train_labels in batch_info:
start_nodes = batch_train_samples
nodes = start_nodes
......@@ -65,11 +78,14 @@ def worker(batch_info, graph, samples):
if len(start_nodes) == 0:
break
feed_dict = {}
feed_dict["nodes"] = [int(n) for n in nodes]
feed_dict["eids"] = [int(e) for e in eids]
feed_dict["node_label"] = [int(n) for n in batch_train_labels]
feed_dict["node_index"] = [int(n) for n in batch_train_samples]
subgraph = graph.subgraph(nodes=nodes, eid=eids)
sub_node_index = subgraph.reindex_from_parrent_nodes(
batch_train_samples)
feed_dict = graph_wrapper.to_feed(subgraph)
feed_dict["node_label"] = np.expand_dims(
np.array(
batch_train_labels, dtype="int64"), -1)
feed_dict["node_index"] = sub_node_index
yield feed_dict
return work
......@@ -82,26 +98,28 @@ def multiprocess_graph_reader(graph,
batch_size,
node_label,
num_workers=4):
"""multiprocess_graph_reader
"""
def parse_to_subgraph(rd):
"""parse_to_subgraph
"""
def work():
"""work
"""
last = time.time()
for data in rd():
nodes = data["nodes"]
eids = data["eids"]
batch_train_labels = data["node_label"]
batch_train_samples = data["node_index"]
subgraph = graph.subgraph(nodes=nodes, eid=eids)
sub_node_index = subgraph.reindex_from_parrent_nodes(
batch_train_samples)
feed_dict = graph_wrapper.to_feed(subgraph)
feed_dict["node_label"] = np.expand_dims(
np.array(
batch_train_labels, dtype="int64"), -1)
feed_dict["node_index"] = sub_node_index
this = time.time()
feed_dict = data
now = time.time()
last = now
yield feed_dict
return work
def reader():
"""reader"""
batch_info = list(
node_batch_iter(
node_index, node_label, batch_size=batch_size))
......@@ -110,9 +128,9 @@ def multiprocess_graph_reader(graph,
for i in range(num_workers):
reader_pool.append(
worker(batch_info[block_size * i:block_size * (i + 1)], graph,
samples))
multi_process_sample = paddle.reader.multiprocess_reader(
reader_pool, use_pipe=False)
graph_wrapper, samples))
multi_process_sample = mp_reader.multiprocess_reader(
reader_pool, use_pipe=True, queue_size=1000)
r = parse_to_subgraph(multi_process_sample)
return paddle.reader.buffered(r, 1000)
......@@ -121,7 +139,10 @@ def multiprocess_graph_reader(graph,
def graph_reader(graph, graph_wrapper, samples, node_index, batch_size,
node_label):
"""graph_reader"""
def reader():
"""reader"""
for batch_train_samples, batch_train_labels in node_batch_iter(
node_index, node_label, batch_size=batch_size):
start_nodes = batch_train_samples
......
......@@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import time
......@@ -34,8 +35,9 @@ def load_data(normalize=True, symmetry=True):
reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
"""
data = np.load("data/reddit.npz")
adj = sp.load_npz("data/reddit_adj.npz")
data_dir = os.path.dirname(os.path.abspath(__file__))
data = np.load(os.path.join(data_dir, "data/reddit.npz"))
adj = sp.load_npz(os.path.join(data_dir, "data/reddit_adj.npz"))
if symmetry:
adj = adj + adj.T
adj = adj.tocoo()
......@@ -64,7 +66,7 @@ def load_data(normalize=True, symmetry=True):
num_nodes=feature.shape[0],
edges=list(zip(src, dst)),
node_feat={"index": np.arange(
0, len(feature), dtype="int32")})
0, len(feature), dtype="int64")})
return {
"graph": graph,
......@@ -82,7 +84,7 @@ def load_data(normalize=True, symmetry=True):
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size, feature):
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int32", append_batch_size=False)
"node_index", shape=[None], dtype="int64", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
......@@ -198,7 +200,9 @@ def main(args):
hide_batch_size=False)
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph", place, node_feat=data['graph'].node_feat_info())
"sub_graph",
fluid.CPUPlace(),
node_feat=data['graph'].node_feat_info())
model_loss, model_acc = build_graph_model(
graph_wrapper,
num_class=data["num_class"],
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import time
import sys
import traceback
import numpy as np
import scipy.sparse as sp
from sklearn.preprocessing import StandardScaler
import pgl
from pgl.utils.logger import log
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import reader
from model import graphsage_mean, graphsage_meanpool,\
graphsage_maxpool, graphsage_lstm
def load_data(normalize=True, symmetry=True):
"""
data from https://github.com/matenure/FastGCN/issues/8
reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
"""
data_dir = os.path.dirname(os.path.abspath(__file__))
data = np.load(os.path.join(data_dir, "data/reddit.npz"))
adj = sp.load_npz(os.path.join(data_dir, "data/reddit_adj.npz"))
if symmetry:
adj = adj + adj.T
adj = adj.tocoo()
src = adj.row
dst = adj.col
num_class = 41
train_label = data['y_train']
val_label = data['y_val']
test_label = data['y_test']
train_index = data['train_index']
val_index = data['val_index']
test_index = data['test_index']
feature = data["feats"].astype("float32")
if normalize:
scaler = StandardScaler()
scaler.fit(feature[train_index])
feature = scaler.transform(feature)
log.info("Feature shape %s" % (repr(feature.shape)))
graph = pgl.graph.Graph(
num_nodes=feature.shape[0],
edges=list(zip(src, dst)),
node_feat={"feat": feature.astype("float32")})
return {
"graph": graph,
"train_index": train_index,
"train_label": train_label,
"val_label": val_label,
"val_index": val_index,
"test_index": test_index,
"test_label": test_label,
"num_class": 41
}
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size):
"""build_graph_model"""
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int64", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
feature = graph_wrapper.node_feat["feat"]
for i in range(k_hop):
if graphsage_type == 'graphsage_mean':
feature = graphsage_mean(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_mean_%s" % i)
elif graphsage_type == 'graphsage_meanpool':
feature = graphsage_meanpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_meanpool_%s" % i)
elif graphsage_type == 'graphsage_maxpool':
feature = graphsage_maxpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
elif graphsage_type == 'graphsage_lstm':
feature = graphsage_lstm(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
else:
raise ValueError("graphsage type %s is not"
" implemented" % graphsage_type)
feature = fluid.layers.gather(feature, node_index)
logits = fluid.layers.fc(feature,
num_class,
act=None,
name='classification_layer')
proba = fluid.layers.softmax(logits)
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=node_label)
loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
return loss, acc
def to_multidevice(batch_iter, num_trainer):
"""to_multidevice"""
batch_dict = []
for batch in batch_iter():
batch_dict.append(batch)
if len(batch_dict) == num_trainer:
yield batch_dict
batch_dict = []
if len(batch_dict) > 0:
log.warning("The batch (%s) can't fill all device (%s)"
"which will be discarded." %
(len(batch_dict), num_trainer))
def run_epoch(batch_iter,
exe,
program,
prefix,
model_loss,
model_acc,
epoch,
log_per_step=100,
num_trainer=1):
"""run_epoch"""
batch = 0
total_loss = 0.
total_acc = 0.
total_sample = 0
start = time.time()
if num_trainer > 1:
batch_iter = to_multidevice(batch_iter, num_trainer)
else:
batch_iter = batch_iter()
for batch_feed_dict in batch_iter:
batch += 1
if num_trainer > 1:
batch_loss, batch_acc = exe.run(
fetch_list=[model_loss.name, model_acc.name],
feed=batch_feed_dict)
batch_loss = np.mean(batch_loss)
batch_acc = np.mean(batch_acc)
else:
batch_loss, batch_acc = exe.run(
program,
fetch_list=[model_loss.name, model_acc.name],
feed=batch_feed_dict)
if batch % log_per_step == 0:
log.info("Batch %s %s-Loss %s %s-Acc %s" %
(batch, prefix, batch_loss, prefix, batch_acc))
if num_trainer > 1:
num_samples = sum(
[len(batch["node_index"]) for batch in batch_feed_dict])
else:
num_samples = len(batch_feed_dict["node_index"])
total_loss += batch_loss * num_samples
total_acc += batch_acc * num_samples
total_sample += num_samples
end = time.time()
log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
(prefix, epoch, total_loss / total_sample,
total_acc / total_sample, (end - start) / batch))
def main(args):
"""main"""
data = load_data(args.normalize, args.symmetry)
log.info("preprocess finish")
log.info("Train Examples: %s" % len(data["train_index"]))
log.info("Val Examples: %s" % len(data["val_index"]))
log.info("Test Examples: %s" % len(data["test_index"]))
log.info("Num nodes %s" % data["graph"].num_nodes)
log.info("Num edges %s" % data["graph"].num_edges)
log.info("Average Degree %s" % np.mean(data["graph"].indegree()))
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
samples = []
if args.samples_1 > 0:
samples.append(args.samples_1)
if args.samples_2 > 0:
samples.append(args.samples_2)
with fluid.program_guard(train_program, startup_program):
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph",
fluid.CPUPlace(),
node_feat=data['graph'].node_feat_info())
model_loss, model_acc = build_graph_model(
graph_wrapper,
num_class=data["num_class"],
hidden_size=args.hidden_size,
graphsage_type=args.graphsage_type,
k_hop=len(samples))
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=args.lr)
adam.minimize(model_loss)
exe = fluid.Executor(place)
exe.run(startup_program)
if args.num_trainer > 1:
build_strategy = fluid.BuildStrategy()
build_strategy.remove_unnecessary_lock = False
build_strategy.enable_sequential_execution = True
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=train_program,
build_strategy=build_strategy,
loss_name=model_loss.name)
else:
train_exe = exe
if args.sample_workers > 1:
train_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
else:
train_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
if args.sample_workers > 1:
val_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
else:
val_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
if args.sample_workers > 1:
test_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
else:
test_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
for epoch in range(args.epoch):
run_epoch(
train_iter,
program=train_program,
exe=train_exe,
prefix="train",
model_loss=model_loss,
model_acc=model_acc,
num_trainer=args.num_trainer,
epoch=epoch)
run_epoch(
val_iter,
program=test_program,
exe=exe,
prefix="val",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
run_epoch(
test_iter,
program=test_program,
prefix="test",
exe=exe,
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='graphsage')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
"--normalize", action='store_true', help="normalize features")
parser.add_argument(
"--symmetry", action='store_true', help="undirect graph")
parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
parser.add_argument("--sample_workers", type=int, default=5)
parser.add_argument("--epoch", type=int, default=10)
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--num_trainer", type=int, default=1)
parser.add_argument("--lr", type=float, default=0.01)
parser.add_argument("--samples_1", type=int, default=25)
parser.add_argument("--samples_2", type=int, default=10)
args = parser.parse_args()
log.info(args)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Multi-GPU settings
"""
import argparse
import time
import numpy as np
import scipy.sparse as sp
from sklearn.preprocessing import StandardScaler
import pgl
from pgl.utils.logger import log
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import reader
from model import graphsage_mean, graphsage_meanpool,\
graphsage_maxpool, graphsage_lstm
def fixed_offset(data, num_nodes, scale):
"""Test
"""
len_data = len(data)
len_per_part = int(len_data / scale)
offset = np.arange(0, scale, dtype="int64")
offset = offset * num_nodes
offset = np.repeat(offset, len_per_part)
if len(data.shape) > 1:
data += offset.reshape([-1, 1])
else:
data += offset
def load_data(normalize=True, symmetry=True, scale=1):
"""
data from https://github.com/matenure/FastGCN/issues/8
reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
"""
data = np.load("data/reddit.npz")
adj = sp.load_npz("data/reddit_adj.npz")
if symmetry:
adj = adj + adj.T
adj = adj.tocoo()
src = adj.row.reshape([-1, 1])
dst = adj.col.reshape([-1, 1])
edges = np.hstack([src, dst])
num_class = 41
train_label = data['y_train']
val_label = data['y_val']
test_label = data['y_test']
train_index = data['train_index']
val_index = data['val_index']
test_index = data['test_index']
feature = data["feats"].astype("float32")
if normalize:
scaler = StandardScaler()
scaler.fit(feature[train_index])
feature = scaler.transform(feature)
if scale > 1:
num_nodes = feature.shape[0]
feature = np.tile(feature, [scale, 1])
train_label = np.tile(train_label, [scale])
val_label = np.tile(val_label, [scale])
test_label = np.tile(test_label, [scale])
edges = np.tile(edges, [scale, 1])
fixed_offset(edges, num_nodes, scale)
train_index = np.tile(train_index, [scale])
fixed_offset(train_index, num_nodes, scale)
val_index = np.tile(val_index, [scale])
fixed_offset(val_index, num_nodes, scale)
test_index = np.tile(test_index, [scale])
fixed_offset(test_index, num_nodes, scale)
log.info("Feature shape %s" % (repr(feature.shape)))
graph = pgl.graph.Graph(
num_nodes=feature.shape[0],
edges=edges,
node_feat={
"index": np.arange(
0, len(feature), dtype="int64"),
"feature": feature
})
return {
"graph": graph,
"train_index": train_index,
"train_label": train_label,
"val_label": val_label,
"val_index": val_index,
"test_index": test_index,
"test_label": test_label,
"feature": feature,
"num_class": 41
}
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size, feature):
"""Test"""
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int64", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
for i in range(k_hop):
if graphsage_type == 'graphsage_mean':
feature = graphsage_mean(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_mean_%s % i")
elif graphsage_type == 'graphsage_meanpool':
feature = graphsage_meanpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_meanpool_%s % i")
elif graphsage_type == 'graphsage_maxpool':
feature = graphsage_maxpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s % i")
elif graphsage_type == 'graphsage_lstm':
feature = graphsage_lstm(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s % i")
else:
raise ValueError("graphsage type %s is not"
" implemented" % graphsage_type)
feature = fluid.layers.gather(feature, node_index)
logits = fluid.layers.fc(feature,
num_class,
act=None,
name='classification_layer')
proba = fluid.layers.softmax(logits)
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=node_label)
loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
return loss, acc
def run_epoch(batch_iter,
exe,
program,
prefix,
model_loss,
model_acc,
epoch,
log_per_step=100):
"""Test"""
batch = 0
total_loss = 0.
total_acc = 0.
total_sample = 0
start = time.time()
for batch_feed_dict in batch_iter():
batch += 1
batch_loss, batch_acc = exe.run(program,
fetch_list=[model_loss, model_acc],
feed=batch_feed_dict)
if batch % log_per_step == 0:
log.info("Batch %s %s-Loss %s %s-Acc %s" %
(batch, prefix, batch_loss, prefix, batch_acc))
num_samples = len(batch_feed_dict["node_index"])
total_loss += batch_loss * num_samples
total_acc += batch_acc * num_samples
total_sample += num_samples
end = time.time()
log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
(prefix, epoch, total_loss / total_sample,
total_acc / total_sample, (end - start) / batch))
def main(args):
"""Test """
data = load_data(args.normalize, args.symmetry, args.scale)
log.info("preprocess finish")
log.info("Train Examples: %s" % len(data["train_index"]))
log.info("Val Examples: %s" % len(data["val_index"]))
log.info("Test Examples: %s" % len(data["test_index"]))
log.info("Num nodes %s" % data["graph"].num_nodes)
log.info("Num edges %s" % data["graph"].num_edges)
log.info("Average Degree %s" % np.mean(data["graph"].indegree()))
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
samples = []
if args.samples_1 > 0:
samples.append(args.samples_1)
if args.samples_2 > 0:
samples.append(args.samples_2)
with fluid.program_guard(train_program, startup_program):
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph",
fluid.CPUPlace(),
node_feat=data['graph'].node_feat_info())
model_loss, model_acc = build_graph_model(
graph_wrapper,
num_class=data["num_class"],
feature=graph_wrapper.node_feat["feature"],
hidden_size=args.hidden_size,
graphsage_type=args.graphsage_type,
k_hop=len(samples))
test_program = train_program.clone(for_test=True)
if args.sample_workers > 1:
train_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
else:
train_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
if args.sample_workers > 1:
val_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
else:
val_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
if args.sample_workers > 1:
test_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
else:
test_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=args.lr)
adam.minimize(model_loss)
exe = fluid.Executor(place)
exe.run(startup_program)
for epoch in range(args.epoch):
run_epoch(
train_iter,
program=train_program,
exe=exe,
prefix="train",
model_loss=model_loss,
model_acc=model_acc,
epoch=epoch)
run_epoch(
val_iter,
program=test_program,
exe=exe,
prefix="val",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
run_epoch(
test_iter,
program=test_program,
prefix="test",
exe=exe,
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='graphsage')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
"--normalize", action='store_true', help="normalize features")
parser.add_argument(
"--symmetry", action='store_true', help="undirect graph")
parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
parser.add_argument("--sample_workers", type=int, default=5)
parser.add_argument("--epoch", type=int, default=10)
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--lr", type=float, default=0.01)
parser.add_argument("--samples_1", type=int, default=25)
parser.add_argument("--samples_2", type=int, default=10)
parser.add_argument("--scale", type=int, default=1)
args = parser.parse_args()
log.info(args)
main(args)
# PGL Examples for LINE
[LINE](http://www.www2015.it/documents/proceedings/proceedings/p1067.pdf) is an algorithmic framework for embedding very large-scale information networks. It is suitable to a variety of networks including directed, undirected, binary or weighted edges. Based on PGL, we reproduce LINE algorithms and reach the same level of indicators as the paper.
## Datasets
[Flickr network](http://socialnetworks.mpi-sws.org/data-imc2007.html) is a social network, which contains 1715256 nodes and 22613981 edges.
You can dowload data from [here](http://socialnetworks.mpi-sws.org/data-imc2007.html).
Flickr network contains four files:
* flickr-groupmemberships.txt.gz
* flickr-groups.txt.gz
* flickr-links.txt.gz
* flickr-users.txt.gz
After downloading the data,uncompress them, let's say, in **./data/flickr/** . Note that the current directory is the root directory of LINE model.
Then you can run the below command to preprocess the data.
```sh
python data_process.py
```
Then it will produce three files in **./data/flickr/** directory:
* nodes.txt
* edges.txt
* nodes_label.txt
## Dependencies
- paddlepaddle>=1.6
- pgl
## How to run
For examples, use gpu to train LINE on Flickr dataset.
```sh
# multiclass task example
python line.py --use_cuda --order first_order --data_path ./data/flickr/ --save_dir ./checkpoints/model/
python multi_class.py --ckpt_path ./checkpoints/model/model_eopch_20 --percent 0.5
```
## Hyperparameters
- -use_cuda: Use gpu if assign use_cuda.
- -order: LINE with First_order Proximity or Second_order Proximity
- -percent: The percentage of data as training data
### Experiment results
Dataset|model|Task|Metric|PGL Result|Reported Result
--|--|--|--|--|--
Flickr|LINE with first_order|multi-label classification|MacroF1|0.626|0.627
Flickr|LINE with first_order|multi-label classification|MicroF1|0.637|0.639
Flickr|LINE with second_order|multi-label classification|MacroF1|0.615|0.621
Flickr|LINE with second_order|multi-label classification|MicroF1|0.630|0.635
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file provides the Dataset for LINE model.
"""
import os
import io
import sys
import numpy as np
from pgl import graph
from pgl.utils.logger import log
class FlickrDataset(object):
"""Flickr dataset implementation
Args:
name: The name of the dataset.
symmetry_edges: Whether to create symmetry edges.
self_loop: Whether to contain self loop edges.
train_percentage: The percentage of nodes to be trained in multi class task.
Attributes:
graph: The :code:`Graph` data object.
num_groups: Number of classes.
train_index: The index for nodes in training set.
test_index: The index for nodes in validation set.
"""
def __init__(self,
data_path,
symmetry_edges=False,
self_loop=False,
train_percentage=0.5):
self.path = data_path
# self.name = name
self.num_groups = 5
self.symmetry_edges = symmetry_edges
self.self_loop = self_loop
self.train_percentage = train_percentage
self._load_data()
def _load_data(self):
edge_path = os.path.join(self.path, 'edges.txt')
node_path = os.path.join(self.path, 'nodes.txt')
nodes_label_path = os.path.join(self.path, 'nodes_label.txt')
all_edges = []
edges_weight = []
with io.open(node_path) as inf:
num_nodes = len(inf.readlines())
node_feature = np.zeros((num_nodes, self.num_groups))
with io.open(nodes_label_path) as inf:
for line in inf:
# group_id means the label of the node
node_id, group_id = line.strip('\n').split(',')
node_id = int(node_id) - 1
labels = group_id.split(' ')
for i in labels:
node_feature[node_id][int(i) - 1] = 1
node_degree_list = [1 for _ in range(num_nodes)]
with io.open(edge_path) as inf:
for line in inf:
items = line.strip().split('\t')
if len(items) == 2:
u, v = int(items[0]), int(items[1])
weight = 1 # binary weight, default set to 1
else:
u, v, weight = int(items[0]), int(items[1]), float(items[
2]),
u, v = u - 1, v - 1
all_edges.append((u, v))
edges_weight.append(weight)
if self.symmetry_edges:
all_edges.append((v, u))
edges_weight.append(weight)
# sum the weights of the same node as the outdegree
node_degree_list[u] += weight
if self.self_loop:
for i in range(num_nodes):
all_edges.append((i, i))
edges_weight.append(1.)
all_edges = list(set(all_edges))
self.graph = graph.Graph(
num_nodes=num_nodes,
edges=all_edges,
node_feat={"group_id": node_feature})
perm = np.arange(0, num_nodes)
np.random.shuffle(perm)
train_num = int(num_nodes * self.train_percentage)
self.train_index = perm[:train_num]
self.test_index = perm[train_num:]
edge_distribution = np.array(edges_weight, dtype=np.float32)
self.edge_distribution = edge_distribution / np.sum(edge_distribution)
self.edge_sampling = AliasSampling(prob=edge_distribution)
node_dist = np.array(node_degree_list, dtype=np.float32)
node_negative_distribution = np.power(node_dist, 0.75)
self.node_negative_distribution = node_negative_distribution / np.sum(
node_negative_distribution)
self.node_sampling = AliasSampling(prob=node_negative_distribution)
self.node_index = {}
self.node_index_reversed = {}
for index, e in enumerate(self.graph.edges):
self.node_index[e[0]] = index
self.node_index_reversed[index] = e[0]
def fetch_batch(self,
batch_size=16,
K=10,
edge_sampling='alias',
node_sampling='alias'):
"""Fetch batch data from dataset.
"""
if edge_sampling == 'numpy':
edge_batch_index = np.random.choice(
self.graph.num_edges,
size=batch_size,
p=self.edge_distribution)
elif edge_sampling == 'alias':
edge_batch_index = self.edge_sampling.sampling(batch_size)
elif edge_sampling == 'uniform':
edge_batch_index = np.random.randint(
0, self.graph.num_edges, size=batch_size)
u_i = []
u_j = []
label = []
for edge_index in edge_batch_index:
edge = self.graph.edges[edge_index]
u_i.append(edge[0])
u_j.append(edge[1])
label.append(1)
for i in range(K):
while True:
if node_sampling == 'numpy':
negative_node = np.random.choice(
self.graph.num_nodes,
p=self.node_negative_distribution)
elif node_sampling == 'alias':
negative_node = self.node_sampling.sampling()
elif node_sampling == 'uniform':
negative_node = np.random.randint(0,
self.graph.num_nodes)
# make sure the sampled node has no edge with the source node
if not self.graph.has_edges_between(
np.array(
[self.node_index_reversed[negative_node]]),
np.array([self.node_index_reversed[edge[0]]])):
break
u_i.append(edge[0])
u_j.append(negative_node)
label.append(-1)
u_i = np.array([u_i], dtype=np.int64).T
u_j = np.array([u_j], dtype=np.int64).T
label = np.array(label, dtype=np.float32)
return u_i, u_j, label
class AliasSampling:
"""Implemention of Alias-Method
This is an implementation of Alias-Method for sampling efficiently from
a discrete probability distribution.
Reference: https://en.wikipedia.org/wiki/Alias_method
Args:
prob: The discrete probability distribution.
"""
def __init__(self, prob):
self.n = len(prob)
self.U = np.array(prob) * self.n
self.K = [i for i in range(len(prob))]
overfull, underfull = [], []
for i, U_i in enumerate(self.U):
if U_i > 1:
overfull.append(i)
elif U_i < 1:
underfull.append(i)
while len(overfull) and len(underfull):
i, j = overfull.pop(), underfull.pop()
self.K[j] = i
self.U[i] = self.U[i] - (1 - self.U[j])
if self.U[i] > 1:
overfull.append(i)
elif self.U[i] < 1:
underfull.append(i)
def sampling(self, n=1):
"""Sampling.
"""
x = np.random.rand(n)
i = np.floor(self.n * x)
y = self.n * x - i
i = i.astype(np.int64)
res = [i[k] if y[k] < self.U[i[k]] else self.K[i[k]] for k in range(n)]
if n == 1:
return res[0]
else:
return res
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file preprocess the FlickrDataset for LINE model.
"""
import argparse
import operator
import os
def process_data(groupsMemberships_file, flickr_links_file, users_label_file,
edges_file, users_file):
"""Preprocess flickr network dataset.
Args:
groupsMemberships_file: flickr-groupmemberships.txt file,
each line is a pair (user, group), which indicates a user belongs to a group.
flickr_links_file: flickr-links.txt file,
each line is a pair (user, user), which indicates
the two users have a relationship.
users_label_file: each line is a pair (user, list of group),
each user may belong to multiple groups.
edges_file: each line is a pair (user, user), which indicates
the two users have a relationship. It filters some unused edges.
users_file: each line is a int number, which indicates the ID of a user.
"""
group2users = {}
with open(groupsMemberships_file, 'r') as f:
for line in f:
user, group = line.strip().split()
try:
group2users[int(group)].append(user)
except:
group2users[int(group)] = [user]
# counting how many users belong to every group
group2usersNum = {}
for key, item in group2users.items():
group2usersNum[key] = len(item)
groups_sorted_by_usersNum = sorted(
group2usersNum.items(), key=operator.itemgetter(1), reverse=True)
# the paper only need the 5 groups with the largest number of users
label = 1 # remapping the 5 groups from 1 to 5
users_label = {}
for i in range(5):
users_list = group2users[groups_sorted_by_usersNum[i][0]]
for user in users_list:
# one user may have multi-labels
try:
users_label[user].append(label)
except:
users_label[user] = [label]
label += 1
# remapping the users IDs to make the IDs from 0 to N
userID2nodeID = {}
count = 1
for key in sorted(users_label.keys()):
userID2nodeID[key] = count
count += 1
with open(users_label_file, 'w') as writer:
for key in sorted(users_label.keys()):
line = ' '.join([str(i) for i in users_label[key]])
writer.write(str(userID2nodeID[key]) + ',' + line + '\n')
# produce edges file
with open(flickr_links_file, 'r') as reader, open(edges_file,
'w') as writer:
for line in reader:
src, dst = line.strip().split('\t')
# filter unused user IDs
if src in users_label and dst in users_label:
# remapping the users IDs
src = userID2nodeID[src]
dst = userID2nodeID[dst]
writer.write(str(src) + '\t' + str(dst) + '\n')
# produce nodes file
with open(users_file, 'w') as writer:
for i in range(1, 1 + len(userID2nodeID)):
writer.write(str(i) + '\n')
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='LINE')
parser.add_argument(
'--groupmemberships',
type=str,
default='./data/flickr/flickr-groupmemberships.txt',
help='groupmemberships of flickr dataset')
parser.add_argument(
'--flickr_links',
type=str,
default='./data/flickr/flickr-links.txt',
help='the flickr-links.txt file for training')
parser.add_argument(
'--nodes_label',
type=str,
default='./data/flickr/nodes_label.txt',
help='nodes (users) label file for training')
parser.add_argument(
'--edges',
type=str,
default='./data/flickr/edges.txt',
help='the result edges (links) file for training')
parser.add_argument(
'--nodes',
type=str,
default='./data/flickr/nodes.txt',
help='the nodes (users) file for training')
args = parser.parse_args()
process_data(args.groupmemberships, args.flickr_links, args.nodes_label,
args.edges, args.nodes)
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -13,8 +13,9 @@
# limitations under the License.
"""Generate pgl apis
"""
__version__ = "0.1.0.beta"
__version__ = "1.0.0"
from pgl import layers
from pgl import graph_wrapper
from pgl import graph
from pgl import data_loader
from pgl import contrib
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册