未验证 提交 97c241f3 编写于 作者: K kirayummy 提交者: GitHub

Merge pull request #2 from PaddlePaddle/master

merge
......@@ -23,7 +23,7 @@ repos:
sha: 5bf6c09bfa1297d3692cadd621ef95f1284e33c0
hooks:
- id: check-added-large-files
args: [--maxkb=1024]
args: [--maxkb=4096]
- id: check-merge-conflict
- id: check-symlinks
- id: detect-private-key
......
# Paddle Graph Learning (PGL)
<img src="./docs/source/_static/logo.png" alt="The logo of Paddle Graph Learning (PGL)" width="320">
[DOC](https://pgl.readthedocs.io/en/latest/) | [Quick Start](https://pgl.readthedocs.io/en/latest/instruction.html)
[DOC](https://pgl.readthedocs.io/en/latest/) | [Quick Start](https://pgl.readthedocs.io/en/latest/quick_start/instruction.html) | [中文](./README.zh.md)
Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle).
Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle).
<img src="https://github.com/PaddlePaddle/PGL/blob/master/docs/source/_static/framework_of_pgl.png" alt="The Framework of Paddle Graph Learning (PGL)" width="800">
<img src="./docs/source/_static/framework_of_pgl.png" alt="The Framework of Paddle Graph Learning (PGL)" width="800">
We provide python interfaces for storing/reading/querying graph structured data and two fundamental computational interfaces, which are walk based paradigm and message-passing based paradigm as shown in the above framework of PGL, for building cutting-edge graph learning algorithms. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.
The newly released PGL supports heterogeneous graph learning on both walk based paradigm and message-passing based paradigm by providing MetaPath sampling and Message Passing mechanism on heterogeneous graph. Furthermor, The newly released PGL also support distributed graph storage and some distributed training algorithms, such as distributed deep walk and distributed graphsage. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.
## Highlight: Efficient and Flexible Message Passing Paradigm
## Highlight: Efficiency - Support Scatter-Gather and LodTensor Message Passing
One of the most important benefits of graph neural networks compared to other models is the ability to use node-to-node connectivity information, but coding the communication between nodes is very cumbersome. At PGL we adopt **Message Passing Paradigm** similar to [DGL](https://github.com/dmlc/dgl) to help to build a customize graph neural network easily. Users only need to write ```send``` and ```recv``` functions to easily implement a simple GCN. As shown in the following figure, for the first step the send function is defined on the edges of the graph, and the user can customize the send function ![](http://latex.codecogs.com/gif.latex?\\phi^e}) to send the message from the source to the target node. For the second step, the recv function ![](http://latex.codecogs.com/gif.latex?\\phi^v}) is responsible for aggregating ![](http://latex.codecogs.com/gif.latex?\\oplus}) messages together from different sources.
<img src="https://github.com/PaddlePaddle/PGL/blob/master/docs/source/_static/message_passing_paradigm.png" alt="The basic idea of message passing paradigm" width="800">
<img src="./docs/source/_static/message_passing_paradigm.png" alt="The basic idea of message passing paradigm" width="800">
As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function ![](http://latex.codecogs.com/gif.latex?\\oplus}) on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.
As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function ![](http://latex.codecogs.com/gif.latex?\\oplus}) on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.
<img src="https://github.com/PaddlePaddle/PGL/blob/master/docs/source/_static/parallel_degree_bucketing.png" alt="The parallel degree bucketing of PGL" width="800">
<img src="./docs/source/_static/parallel_degree_bucketing.png" alt="The parallel degree bucketing of PGL" width="800">
Users only need to call the ```sequence_ops``` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ```sequence_pool``` to sum the neighbor message.
......@@ -33,14 +33,14 @@ Users only need to call the ```sequence_ops``` functions provided by Paddle to e
```
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
## Performance
### Performance
We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
We test all the following GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
| Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL speed (epoch time) |
| Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) |
| -------- | ----- | ----------------- | ------------ | ------------------------------------ |
| Cora | GCN |81.75% | 0.0047s | **0.0045s** |
| Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
......@@ -49,12 +49,64 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t
| Citeseer | GCN |70.2%| **0.0045** |0.0046s|
| Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) that aggregates neighbor features with LSTM ignoring the order of recieved messages, the optimized message-passing in DGL will be forced to degenerate into degree bucketing scheme. The speed performance will be much slower than the one implemented in PGL. Performances may be various with different scale of the graph, in our experiments, PGL can reach up to 13 times the speed of DGL.
| Dataset | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | Speed up|
| -------- | ------------ | ------------------------------------ |----|
| Cora | **0.0186s** | 0.1638s | 8.80x|
| Pubmed | **0.0388s** |0.5275s | 13.59x|
| Citeseer | **0.0150s** | 0.1278s | 8.52x |
## Highlight: Flexibility - Natively Support Heterogeneous Graph Learning
Graph can conveniently represent the relation between things in the real world, but the categories of things and the relation between things are various. Therefore, in the heterogeneous graph, we need to distinguish the node types and edge types in the graph network. PGL models heterogeneous graphs that contain multiple node types and multiple edge types, and can describe complex connections between different types.
### Support meta path walk sampling on heterogeneous graph
<img src="./docs/source/_static/metapath_sampling.png" alt="The metapath sampling in heterogeneous graph" width="800">
The left side of the figure above describes a shopping social network. The nodes above have two categories of users and goods, and the relations between users and users, users and goods, and goods and goods. The right of the above figure is a simple sampling process of MetaPath. When you input any MetaPath as UPU (user-product-user), you will find the following results
<img src="./docs/source/_static/metapath_result.png" alt="The metapath result" width="320">
Then on this basis, and introducing word2vec and other methods to support learning metapath2vec and other algorithms of heterogeneous graph representation.
### Support Message Passing mechanism on heterogeneous graph
<img src="./docs/source/_static/him_message_passing.png" alt="The message passing mechanism on heterogeneous graph" width="800">
Because of the different node types on the heterogeneous graph, the message delivery is also different. As shown on the left, it has five neighbors, belonging to two different node types. As shown on the right of the figure above, nodes belonging to different types need to be aggregated separately during message delivery, and then merged into the final message to update the target node. On this basis, PGL supports heterogeneous graph algorithms based on message passing, such as GATNE and other algorithms.
## Large-Scale: Support distributed graph storage and distributed training algorithms
In most cases of large-scale graph learning, we need distributed graph storage and distributed training support. As shown in the following figure, PGL provided a general solution of large-scale training, we adopted [PaddleFleet](https://github.com/PaddlePaddle/Fleet) as our distributed parameter servers, which supports large scale distributed embeddings and a lightweighted distributed storage engine so it can easily set up a large scale distributed training algorithm with MPI clusters.
<img src="./docs/source/_static/distributed_frame.png" alt="The distributed frame of PGL" width="800">
## Model Zoo
The following are 13 graph learning models that have been implemented in the framework. See the details [here](https://pgl.readthedocs.io/en/latest/introduction.html#highlight-tons-of-models)
|Model | feature |
|---|---|
| GCN | Graph Convolutional Neural Networks |
| GAT | Graph Attention Network |
| GraphSage |Large-scale graph convolution network based on neighborhood sampling|
| unSup-GraphSage | Unsupervised GraphSAGE |
| LINE | Representation learning based on first-order and second-order neighbors |
| DeepWalk | Representation learning by DFS random walk |
| MetaPath2Vec | Representation learning based on metapath |
| Node2Vec | The representation learning Combined with DFS and BFS |
| Struct2Vec | Representation learning based on structural similarity |
| SGC | Simplified graph convolution neural network |
| GES | The graph represents learning method with node features |
| DGI | Unsupervised representation learning based on graph convolution network |
| GATNE | Representation Learning of Heterogeneous Graph based on MessagePassing |
The above models consists of three parts, namely, graph representation learning, graph neural network and heterogeneous graph learning, which are also divided into graph representation learning and graph neural network.
## System requirements
PGL requires:
* paddle >= 1.5
* networkx
* paddle >= 1.6
* cython
......@@ -63,15 +115,18 @@ PGL supports both Python 2 & 3
## Installation
pip install pgl
You can simply install it via pip.
```sh
pip install pgl
```
## The Team
PGL is developed and maintained by NLP and Paddle Teams at Baidu
E-mail: nlp-gnn[at]baidu.com
## License
PGL uses Apache License 2.0.
<img src="./docs/source/_static/logo.png" alt="The logo of Paddle Graph Learning (PGL)" width="320">
[文档](https://pgl.readthedocs.io/en/latest/) | [快速开始](https://pgl.readthedocs.io/en/latest/quick_start/instruction.html) | [English](./README.md)
Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/PaddlePaddle/Paddle)的高效易用的图学习框架
<img src="./docs/source/_static/framework_of_pgl.png" alt="The Framework of Paddle Graph Learning (PGL)" width="800">
在最新发布的PGL中引入了异构图的支持,新增MetaPath采样支持异构图表示学习,新增异构图Message Passing机制支持基于消息传递的异构图算法,利用新增的异构图接口,能轻松搭建前沿的异构图学习算法。而且,在最新发布的PGL中,同时也增加了分布式图存储以及一些分布式图学习训练算法,例如,分布式deep walk和分布式graphsage。结合PaddlePaddle深度学习框架,我们的框架基本能够覆盖大部分的图网络应用,包括图表示学习以及图神经网络。
# 特色:高效性——支持Scatter-Gather及LodTensor消息传递
对比于一般的模型,图神经网络模型最大的优势在于它利用了节点与节点之间连接的信息。但是,如何通过代码来实现建模这些节点连接十分的麻烦。PGL采用与[DGL](https://github.com/dmlc/dgl)相似的**消息传递范式**用于作为构建图神经网络的接口。用于只需要简单的编写```send```还有```recv```函数就能够轻松的实现一个简单的GCN网络。如下图所示,首先,send函数被定义在节点之间的边上,用户自定义send函数![](http://latex.codecogs.com/gif.latex?\\phi^e})会把消息从源点发送到目标节点。然后,recv函数![](http://latex.codecogs.com/gif.latex?\\phi^v})负责将这些消息用汇聚函数 ![](http://latex.codecogs.com/gif.latex?\\oplus}) 汇聚起来。
<img src="./docs/source/_static/message_passing_paradigm.png" alt="The basic idea of message passing paradigm" width="800">
如下面左图所示,为了去适配用户定义的汇聚函数,DGL使用了Degree Bucketing来将相同度的节点组合在一个块,然后将汇聚函数![](http://latex.codecogs.com/gif.latex?\\oplus})作用在每个块之上。而对于PGL的用户定义汇聚函数,我们则将消息以PaddlePaddle的[LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html)的形式处理,将若干消息看作一组变长的序列,然后利用**LodTensor在PaddlePaddle的特性进行快速平行的消息聚合**
<img src="./docs/source/_static/parallel_degree_bucketing.png" alt="The parallel degree bucketing of PGL" width="800">
用户只需要简单的调用PaddlePaddle序列相关的函数```sequence_ops```就可以实现高效的消息聚合了。举个例子,下面就是简单的利用```sequence_pool```来做邻居消息求和。
```python
import paddle.fluid as fluid
def recv(msg):
return fluid.layers.sequence_pool(msg, "sum")
```
尽管DGL用了一些内核融合(kernel fusion)的方法来将常用的sum,max等聚合函数用scatter-gather进行优化。但是对于**复杂的用户定义函数**,他们使用的Degree Bucketing算法,仅仅使用串行的方案来处理不同的分块,并不会充分利用GPU进行加速。然而,在PGL中我们使用基于LodTensor的消息传递能够充分地利用GPU的并行优化,在复杂的用户定义函数下,PGL的速度在我们的实验中甚至能够达到DGL的13倍。即使不使用scatter-gather的优化,PGL仍然有高效的性能表现。当然,我们也是提供了scatter优化的聚合函数。
### 性能测试
我们用Tesla V100-SXM2-16G测试了下列所有的GNN算法,每一个算法跑了200个Epoch来计算平均速度。准确率是在测试集上计算出来的,并且我们没有使用Early-stopping策略。
| 数据集 | 模型 | PGL准确率 | PGL速度 (epoch) | DGL 0.3.0 速度 (epoch) |
| -------- | ----- | ----------------- | ------------ | ------------------------------------ |
| Cora | GCN |81.75% | 0.0047s | **0.0045s** |
| Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
| Pubmed | GCN |79.2% |**0.0049s** |0.0051s |
| Pubmed | GAT | 77% |0.0193s|**0.0144s**|
| Citeseer | GCN |70.2%| **0.0045** |0.0046s|
| Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
如果我们使用复杂的用户定义聚合函数,例如像[GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf)那样忽略邻居信息的获取顺序,利用LSTM来聚合节点的邻居特征。DGL所使用的消息传递函数将退化成Degree Bucketing模式,在这个情况下DGL实现的模型会比PGL的慢的多。模型的性能会随着图规模而变化,在我们的实验中,PGL的速度甚至能够能达到DGL的13倍。
| 数据集 | PGL速度 (epoch) | DGL 0.3.0 速度 (epoch time) | 加速比 |
| -------- | ------------ | ------------------------------------ |----|
| Cora | **0.0186s** | 0.1638s | 8.80x|
| Pubmed | **0.0388s** |0.5275s | 13.59x|
| Citeseer | **0.0150s** | 0.1278s | 8.52x |
## 特色:易用性——原生支持异构图
图可以很方便的表示真实世界中事物之间的联系,但是事物的类别以及事物之间的联系多种多样,因此,在异构图中,我们需要对图网络中的节点类型以及边类型进行区分。PGL针对异构图包含多种节点类型和多种边类型的特点进行建模,可以描述不同类型之间的复杂联系。
### 支持异构图MetaPath walk采样
<img src="./docs/source/_static/metapath_sampling.png" alt="The metapath sampling in heterogeneous graph" width="800">
上图左边描述的是一个购物的社交网络,上面的节点有用户和商品两大类,关系有用户和用户之间的关系,用户和商品之间的关系以及商品和商品之间的关系。上图的右边是一个简单的MetaPath采样过程,输入metapath为UPU(user-product-user),采出结果为
<img src="./docs/source/_static/metapath_result.png" alt="The metapath result" width="320">
然后在此基础上引入word2vec等方法,支持异构图表示学习metapath2vec等算法。
### 支持异构图Message Passing机制
<img src="./docs/source/_static/him_message_passing.png" alt="The message passing mechanism on heterogeneous graph" width="800">
在异构图上由于节点类型不同,消息传递也方式也有所不同。如上图左边,它有五个邻居节点,属于两种不同的节点类型。如上图右边,在消息传递的时候需要把属于不同类型的节点分开聚合,然后在合并成最终的消息,从而更新目标节点。在此基础上PGL支持基于消息传递的异构图算法,如GATNE等算法。
## 特色:规模性——支持分布式图存储以及分布式学习算法
在大规模的图网络学习中,通常需要多机图存储以及多机分布式训练。如下图所示,PGL提供一套大规模训练的解决方案,我们利用[PaddleFleet](https://github.com/PaddlePaddle/Fleet)(支持大规模分布式Embedding学习)作为我们参数服务器模块以及一套简易的分布式存储方案,可以轻松在MPI集群上搭建分布式大规模图学习方法。
<img src="./docs/source/_static/distributed_frame.png" alt="The distributed frame of PGL" width="800">
## 丰富性——覆盖业界大部分图学习网络
下列是框架中已经自带实现的十三种图网络学习模型。详情请参考[这里](https://pgl.readthedocs.io/en/latest/introduction.html#highlight-tons-of-models)
| 模型 | 特点 |
|---|---|
| GCN | 图卷积网络 |
| GAT | 基于Attention的图卷积网络 |
| GraphSage | 基于邻居采样的大规模图卷积网络 |
| unSup-GraphSage | 无监督学习的GraphSAGE |
| LINE | 基于一阶、二阶邻居的表示学习 |
| DeepWalk | DFS随机游走的表示学习 |
| MetaPath2Vec | 基于metapath的表示学习 |
| Node2Vec | 结合DFS及BFS的表示学习 |
| Struct2Vec | 基于结构相似的表示学习 |
| SGC | 简化的图卷积网络 |
| GES | 加入节点特征的图表示学习方法 |
| DGI | 基于图卷积网络的无监督表示学习 |
| GATNE | 基于MessagePassing的异构图表示学习 |
上述模型包含图表示学习,图神经网络以及异构图三部分,而异构图里面也分图表示学习和图神经网络。
## 依赖
PGL依赖于:
* paddle >= 1.6
* cython
PGL支持Python 2和3。
## 安装
你可以简单的用pip进行安装。
```sh
pip install pgl
```
## 团队
PGL由百度的NLP以及Paddle团队共同开发以及维护。
联系方式 E-mail: nlp-gnn[at]baidu.com
## License
PGL uses Apache License 2.0.
sphinx==2.1.0
mistune
sphinx_rtd_theme
numpy >= 1.16.4
cython >= 0.25.2
paddlepaddle
pgl
pgl.contrib.heter\_graph module: Heterogenous Graph Storage
===============================
.. automodule:: pgl.contrib.heter_graph
:members:
:undoc-members:
:show-inheritance:
pgl.contrib.heter\_graph\_wrapper module: Heterogenous Graph data holders for Paddle GNN.
=========================
.. automodule:: pgl.contrib.heter_graph_wrapper
:members:
:undoc-members:
:show-inheritance:
......@@ -8,3 +8,6 @@ API Reference
pgl.layers
pgl.data_loader
pgl.utils.paddle_helper
pgl.utils.mp_reader
pgl.contrib.heter_graph
pgl.contrib.heter_graph_wrapper
pgl.utils.mp\_reader module: MultiProcessing reader helper function for Paddle.
===============================
.. automodule:: pgl.utils.mp_reader
:members:
:undoc-members:
:show-inheritance:
......@@ -40,7 +40,7 @@ copyright = '2019, PaddlePaddle'
author = 'PaddlePaddle'
# The full version, including alpha/beta/rc tags
release = '0.1.0.beta'
release = '1.0.1'
# -- General configuration ---------------------------------------------------
......@@ -73,13 +73,12 @@ lanaguage = "zh_cn"
html_theme = "sphinx_rtd_theme"
html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
html_show_sourcelink = False
#html_logo = 'pgl_logo.png'
html_logo = '_static/logo.png'
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
'''
html_theme_options = {
'canonical_url': '',
'analytics_id': 'UA-XXXXXXX-1', # Provided by Google in your dashboard
......@@ -96,4 +95,3 @@ html_theme_options = {
'includehidden': True,
'titles_only': False
}
'''
.. mdinclude:: ../../../examples/dgi/README.md
.. mdinclude:: ../../../examples/distribute_deepwalk/README.md
.. mdinclude:: ../../../examples/distribute_graphsage/README.md
.. mdinclude:: md/gat_examples.md
.. mdinclude:: ../../../examples/gat/README.md
View the Code
=============
examples/gat/train.py
.. literalinclude:: ../../../examples/gat/train.py
:language: python
:linenos:
.. mdinclude:: ../../../examples/GATNE/README.md
.. mdinclude:: md/gcn_examples.md
.. mdinclude:: ../../../examples/gcn/README.md
View the Code
=============
examples/gcn/train.py
.. literalinclude:: ../../../examples/gcn/train.py
:language: python
:linenos:
.. mdinclude:: ../../../examples/ges/README.md
.. mdinclude:: md/graphsage_examples.md
.. mdinclude:: ../../../examples/graphsage/README.md
View the Code
=============
examples/graphsage/train.py
.. literalinclude:: ../../../examples/graphsage/train.py
:language: python
:linenos:
examples/graphsage/reader.py
.. literalinclude:: ../../../examples/graphsage/reader.py
:language: python
:linenos:
examples/graphsage/model.py
.. literalinclude:: ../../../examples/graphsage/model.py
:language: python
:linenos:
.. mdinclude:: ../../../examples/line/README.md
# Building Graph Attention Networks
[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks.
### Simple example to build single head GAT
To build a gat layer, one can use our pre-defined ```pgl.layers.gat``` or just write a gat layer with message passing interface.
```python
import paddle.fluid as fluid
def gat_layer(graph_wrapper, node_feature, hidden_size):
def send_func(src_feat, dst_feat, edge_feat):
logits = src_feat["a1"] + dst_feat["a2"]
logits = fluid.layers.leaky_relu(logits, alpha=0.2)
return {"logits": logits, "h": src_feat }
def recv_func(msg):
norm = fluid.layers.sequence_softmax(msg["logits"])
output = msg["h"] * norm
return output
h = fluid.layers.fc(node_feature, hidden_size, bias_attr=False, name="hidden")
a1 = fluid.layers.fc(node_feature, 1, name="a1_weight")
a2 = fluid.layers.fc(node_feature, 1, name="a2_weight")
message = graph_wrapper.send(send_func,
nfeat_list=[("h", h), ("a1", a1), ("a2", a2)])
output = graph_wrapper.recv(recv_func, message)
return output
```
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~83% | 0.0188s | 0.0175s |
| Pubmed | ~78% | 0.0449s | 0.0295s |
| Citeseer | ~70% | 0.0275 | 0.0253s |
### How to run
For examples, use gpu to train gat on cora dataset.
```
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
### View the Code
See the code [here](gat_examples_code.html)
# Building Graph Convolutional Network
[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks.
### Simple example to build GCN
To build a gcn layer, one can use our pre-defined ```pgl.layers.gcn``` or just write a gcn layer with message passing interface.
```python
import paddle.fluid as fluid
def gcn_layer(graph_wrapper, node_feature, hidden_size, act):
def send_func(src_feat, dst_feat, edge_feat):
return src_feat["h"]
def recv_func(msg):
return fluid.layers.sequence_pool(msg, "sum")
message = graph_wrapper.send(send_func, nfeat_list=[("h", node_feature)])
output = graph_wrapper.recv(recv_func, message)
output = fluid.layers.fc(output, size=hidden_size, act=act)
return output
```
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~81% | 0.0106s | 0.0104s |
| Pubmed | ~79% | 0.0210s | 0.0154s |
| Citeseer | ~71% | 0.0175s | 0.0177s |
### How to run
For examples, use gpu to train gcn on cora dataset.
```
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
### View the Code
See the code [here](gcn_examples_code.html)
# Graph Representation Learning: Node2vec
[Node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce node2vec algorithms and reach the same level of indicators as the paper.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3) and [Arxiv](http://snap.stanford.edu/data/ca-AstroPh.html).
## Dependencies
- paddlepaddle>=1.4
- pgl
## How to run
For examples, use gpu to train gcn on cora dataset.
```sh
# multiclass task example
python node2vec.py --use_cuda --dataset BlogCatalog --save_path ./tmp/node2vec_BlogCatalog/ --offline_learning --epoch 400
python multi_class.py --use_cuda --ckpt_path ./tmp/node2vec_BlogCatalog/paddle_model --epoch 1000
# link prediction task example
python node2vec.py --use_cuda --dataset ArXiv --save_path
./tmp/node2vec_ArXiv --offline_learning --epoch 10
python link_predict.py --use_cuda --ckpt_path ./tmp/node2vec_ArXiv/paddle_model --epoch 400
```
## Hyperparameters
- dataset: The citation dataset "BlogCatalog" and "ArXiv".
- use_cuda: Use gpu if assign use_cuda.
### Experiment results
Dataset|model|Task|Metric|PGL Result|Reported Result
--|--|--|--|--|--
BlogCatalog|deepwalk|multi-label classification|MacroF1|0.250|0.211
BlogCatalog|node2vec|multi-label classification|MacroF1|0.262|0.258
ArXiv|deepwalk|link prediction|AUC|0.9538|0.9340
ArXiv|node2vec|link prediction|AUC|0.9541|0.9366
## View the Code
See the code [here](node2vec_examples_code.html)
# StaticGraphWrapper for GAT Speed Optimization
[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks.
However, different from the reproduction in **examples/gat**, we use `pgl.graph_wrapper.StaticGraphWrapper` to preload the graph data into gpu or cpu memories which achieves better performance on speed.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gat | Improvement |
| --- | --- | --- |---| --- | --- |
| Cora | ~83% | 0.0145s | 0.0119s | 0.0175s | 1.47x |
| Pubmed | ~78% | 0.0352s | 0.0193s |0.0295s | 1.53x |
| Citeseer | ~70% | 0.0148s | 0.0124s |0.0253s | 2.04x |
### How to run
For examples, use gpu to train gat on cora dataset.
```sh
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
### View the Code
See the code [here](static_gat_examples_code.html)
# StaticGraphWrapper for GCN Speed Optimization
[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks.
However, different from the reproduction in **examples/gcn**, we use `pgl.graph_wrapper.StaticGraphWrapper` to preload the graph data into gpu or cpu memories which achieves better performance on speed.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gcn | Improvement |
| --- | --- | --- |---| --- | --- |
| Cora | ~81% | 0.0053s | 0.0047s | 0.0104s | 2.21x |
| Pubmed | ~79% | 0.0105s | 0.0049s |0.0154s | 3.14x |
| Citeseer | ~71% | 0.0051s | 0.0045s |0.0177s | 3.93x |
### How to run
For examples, use gpu to train gcn on cora dataset.
```sh
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
### View the Code
See the code [here](static_gcn_examples_code.html)
.. mdinclude:: ../../../examples/metapath2vec/README.md
.. mdinclude:: md/node2vec_examples.md
.. mdinclude:: ../../../examples/node2vec/README.md
View the Code
=============
examples/node2vec/node2vec.py
.. literalinclude:: ../../../examples/node2vec/node2vec.py
:language: python
:linenos:
examples/node2vec/multi_class.py
.. literalinclude:: ../../../examples/node2vec/multi_class.py
:language: python
:linenos:
examples/node2vec/link_predict.py
.. literalinclude:: ../../../examples/node2vec/link_predict.py
:language: python
:linenos:
.. mdinclude:: ../../../examples/sgc/README.md
.. mdinclude:: md/static_gat_examples.md
.. mdinclude:: ../../../examples/static_gat/README.md
View the Code
=============
examples/static_gat/train.py
.. literalinclude:: ../../../examples/static_gat/train.py
:language: python
:linenos:
.. mdinclude:: md/static_gcn_examples.md
.. mdinclude:: ../../../examples/static_gcn/README.md
View the Code
=============
examples/static_gcn/train.py
.. literalinclude:: ../../../examples/static_gcn/train.py
:language: python
:linenos:
.. mdinclude:: ../../../examples/strucvec/README.md
.. mdinclude:: ../../../examples/unsup_graphsage/README.md
......@@ -15,14 +15,9 @@ Quick Start
.. toctree::
:maxdepth: 1
:caption: Quick Start
:hidden:
instruction.rst
See instruction_ for quick start.
.. _instruction: instruction.html
quick_start/instruction.rst
quick_start/introduction_for_hetergraph.rst
.. toctree::
......@@ -34,7 +29,16 @@ See instruction_ for quick start.
examples/static_graph_wrapper.rst
examples/node2vec_examples.rst
examples/graphsage_examples.rst
examples/dgi_examples.rst
examples/distribute_deepwalk_examples.rst
examples/distribute_graphsage_examples.rst
examples/ges_examples.rst
examples/line_examples.rst
examples/sgc_examples.rst
examples/strucvec_examples.rst
examples/gatne_examples.rst
examples/metapath2vec_examples.rst
examples/unsup_graphsage_examples.rst
.. toctree::
:maxdepth: 2
......
# Paddle Graph Learning (PGL)
# Paddle Graph Learning (PGL)
Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle).
Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle).
<div />
......@@ -9,17 +8,18 @@ Paddle Graph Learning (PGL) is an efficient and flexible graph learning framewor
<center>The Framework of Paddle Graph Learning (PGL)</center>
<div />
We provide python interfaces for storing/reading/querying graph structured data and two fundamental computational interfaces, which are walk based paradigm and message-passing based paradigm as shown in the above framework of PGL, for building cutting-edge graph learning algorithms. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.
The newly released PGL supports heterogeneous graph learning on both walk based paradigm and message-passing based paradigm by providing MetaPath sampling and Message Passing mechanism on heterogeneous graph. Furthermor, The newly released PGL also support distributed graph storage and some distributed training algorithms, such as distributed deep walk and distributed graphsage. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications.
## Highlight: Efficiency - Support Scatter-Gather and LodTensor Message Passing
## Highlight: Efficient and Flexible <br/> Message Passing Paradigm
One of the most important benefits of graph neural networks compared to other models is the ability to use node-to-node connectivity information, but coding the communication between nodes is very cumbersome. At PGL we adopt **Message Passing Paradigm** similar to DGL to help to build a customize graph neural network easily. Users only need to write ``send`` and ``recv`` functions to easily implement a simple GCN. As shown in the following figure, for the first step the send function is defined on the edges of the graph, and the user can customize the send function $\phi^e$ to send the message from the source to the target node. For the second step, the recv function $\phi^v$ is responsible for aggregating $\oplus$ messages together from different sources.
<div />
<div align=center><img src="_static/message_passing_paradigm.png" width="700"></div>
<center>The basic idea of message passing paradigm</center>
<div />
As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function $\oplus$ on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.
As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function $\oplus$ on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.
<div/>
......@@ -28,21 +28,22 @@ As shown in the left of the following figure, to adapt general user-defined mess
<div/>
Users only need to call the ```sequence_ops``` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ```sequence_pool``` to sum the neighbor message.
Users only need to call the ``sequence_ops`` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ``sequence_pool`` to sum the neighbor message.
```python
import paddle.fluid as fluid
def recv(msg):
return fluid.layers.sequence_pool(msg, "sum")
```
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions.
## Performance
### Performance
We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
| Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL speed (epoch time) |
We test all the following GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping.
| Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) |
| -------- | ----- | ----------------- | ------------ | ------------------------------------ |
| Cora | GCN |81.75% | 0.0047s | **0.0045s** |
| Cora | GAT | 83.5% | **0.0119s** | 0.0141s |
......@@ -50,3 +51,95 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t
| Pubmed | GAT | 77% |0.0193s|**0.0144s**|
| Citeseer | GCN |70.2%| **0.0045** |0.0046s|
| Citeseer | GAT |68.8%| **0.0124s** |0.0139s|
If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) that aggregates neighbor features with LSTM ignoring the order of recieved messages, the optimized message-passing in DGL will be forced to degenerate into degree bucketing scheme. The speed performance will be much slower than the one implemented in PGL. Performances may be various with different scale of the graph, in our experiments, PGL can reach up to 13 times the speed of DGL.
| Dataset | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | Speed up|
| -------- | ------------ | ------------------------------------ |----|
| Cora | **0.0186s** | 0.1638s | 8.80x|
| Pubmed | **0.0388s** |0.5275s | 13.59x|
| Citeseer | **0.0150s** | 0.1278s | 8.52x |
## Highlight: Flexibility - Natively Support Heterogeneous Graph Learning
Graph can conveniently represent the relation between things in the real world, but the categories of things and the relation between things are various. Therefore, in the heterogeneous graph, we need to distinguish the node types and edge types in the graph network. PGL models heterogeneous graphs that contain multiple node types and multiple edge types, and can describe complex connections between different types.
### Support meta path walk sampling on heterogeneous graph
<div/>
<div align=center><img src="_static/metapath_sampling.png" width="750"></div>
<center>The metapath sampling in heterogeneous graph</center>
<div/>
The left side of the figure above describes a shopping social network. The nodes above have two categories of users and goods, and the relations between users and users, users and goods, and goods and goods. The right of the above figure is a simple sampling process of MetaPath. When you input any MetaPath as UPU (user-product-user), you will find the following results
<div/>
<div align=center><img src="_static/metapath_result.png" width="300"></div>
<center>The metapath result</center>
<div/>
Then on this basis, and introducing word2vec and other methods to support learning metapath2vec and other algorithms of heterogeneous graph representation.
### Support Message Passing mechanism on heterogeneous graph
<div/>
<div align=center><img src="_static/him_message_passing.png" width="750"></div>
<center>The message passing mechanism on heterogeneous graph</center>
<div/>
Because of the different node types on the heterogeneous graph, the message delivery is also different. As shown on the left, it has five neighbors, belonging to two different node types. As shown on the right of the figure above, nodes belonging to different types need to be aggregated separately during message delivery, and then merged into the final message to update the target node. On this basis, PGL supports heterogeneous graph algorithms based on message passing, such as GATNE and other algorithms.
## Large-Scale: Support distributed graph storage and distributed training algorithms
In most cases of large-scale graph learning, we need distributed graph storage and distributed training support. As shown in the following figure, PGL provided a general solution of large-scale training, we adopted [PaddleFleet](https://github.com/PaddlePaddle/Fleet) as our distributed parameter servers, which supports large scale distributed embeddings and a lightweighted distributed storage engine so tcan easily set up a large scale distributed training algorithm with MPI clusters.
<div/>
<div align=center><img src="_static/distributed_frame.png" width="750"></div>
<center>The distributed frame of PGL</center>
<div/>
## Model Zoo
The following are 13 graph learning models that have been implemented in the framework.
|Model | feature |
|---|---|
| [GCN](examples/gcn_examples.html)| Graph Convolutional Neural Networks |
| [GAT](examples/gat_examples.html)| Graph Attention Network |
| [GraphSage](examples/graphsage_examples.html)|Large-scale graph convolution network based on neighborhood sampling|
| [unSup-GraphSage](examples/unsup_graphsage_examples.html) | Unsupervised GraphSAGE |
| [LINE](examples/line_examples.html)| Representation learning based on first-order and second-order neighbors |
| [DeepWalk](examples/distribute_deepwalk_examples.html)| Representation learning by DFS random walk |
| [MetaPath2Vec](examples/metapath2vec_examples.html)| Representation learning based on metapath |
| [Node2Vec](examples/node2vec_examples.html)| The representation learning Combined with DFS and BFS |
| [Struct2Vec](examples/strucvec_examples.html)| Representation learning based on structural similarity |
| [SGC](examples/sgc_examples.html)| Simplified graph convolution neural network |
| [GES](examples/ges_examples.html)| The graph represents learning method with node features |
| [DGI](examples/dgi_examples.html)| Unsupervised representation learning based on graph convolution network |
| [GATNE](examples/gatne_examples.html)| Representation Learning of Heterogeneous Graph based on MessagePassing |
The above models consists of three parts, namely, graph representation learning, graph neural network and heterogeneous graph learning, which are also divided into graph representation learning and graph neural network.
## System requirements
PGL requires:
* paddle >= 1.6
* cython
PGL supports both Python 2 & 3
## Installation
You can simply install it via pip.
```sh
pip install pgl
```
## The Team
PGL is developed and maintained by NLP and Paddle Teams at Baidu
## License
PGL uses Apache License 2.0.
......@@ -8,8 +8,7 @@ To install Paddle Graph Learning, we need the following packages.
.. code-block:: sh
paddlepaddle >= 1.4 (Faster performance on 1.5)
networkx
paddlepaddle >= 1.6
cython
We can simply install pgl by pip.
......
Quick Start with Heterogenous Graph
========================
Install PGL
-----------
To install Paddle Graph Learning, we need the following packages.
.. code-block:: sh
paddlepaddle >= 1.6
cython
We can simply install pgl by pip.
.. code-block:: sh
pip install pgl
.. mdinclude:: md/quick_start_for_heterGraph.md
## Step 1: using PGL to create a graph
Suppose we have a graph with 10 nodes and 14 edges as shown in the following figure:
![A simple graph](_static/quick_start_graph.png)
![A simple graph](images/quick_start_graph.png)
Our purpose is to train a graph neural network to classify yellow and green nodes. So we can create this graph in such way:
```python
......@@ -49,7 +49,7 @@ Currently our PGL is developed based on static computational mode of paddle (we
import paddle.fluid as fluid
use_cuda = False
place = fluid.GPUPlace(0) if use_cuda else fluid.CPUPlace()
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
# use GraphWrapper as a container for graph data to construct a graph neural network
gw = pgl.graph_wrapper.GraphWrapper(name='graph',
......@@ -95,7 +95,7 @@ After defining the GCN layer, we can construct a deeper GCN model with two GCN l
```python
output = gcn_layer(gw, gw.node_feat['feature'],
hidden_size=8, name='gcn_layer_1', activation='relu')
output = gcn_layer(gw, output, hidden_size=2,
output = gcn_layer(gw, output, hidden_size=1,
name='gcn_layer_2', activation=None)
```
......
## Introduction
In real world, there exists many graphs contain multiple types of nodes and edges, which we call them Heterogeneous Graphs. Obviously, heterogenous graphs are more complex than homogeneous graphs.
To deal with such heterogeneous graphs, PGL develops a graph framework to support graph neural network computations and meta-path-based sampling on heterogenous graph.
The goal of this tutorial:
* example of heterogenous graph data;
* Understand how PGL supports computations in heterogenous graph;
* Using PGL to implement a simple heterogenous graph neural network model to classfiy a particular type of node in a heterogenous graph network.
## Example of heterogenous graph
There are a lot of graph data that consists of edges and nodes of multiple types. For example, **e-commerce network** is very common heterogenous graph in real world. It contains at least two types of nodes (user and item) and two types of edges (buy and click).
The following figure depicts several users click or buy some items. This graph has two types of nodes corresponding to "user" and "item". It also contain two types of edge "buy" and "click".
![A simple heterogenous e-commerce graph](images/heter_graph_introduction.png)
## Creating a heterogenous graph with PGL
In heterogenous graph, there exists multiple edges, so we should distinguish them. In PGL, the edges are built in below format:
```python
edges = {
'click': [(0, 4), (0, 7), (1, 6), (2, 5), (3, 6)],
'buy': [(0, 5), (1, 4), (1, 6), (2, 7), (3, 5)],
}
```
In heterogenous graph, nodes are also of different types. Therefore, you need to mark the type of each node, the format of the node type is as follows:
```python
node_types = [(0, 'user'), (1, 'user'), (2, 'user'), (3, 'user'), (4, 'item'),
(5, 'item'),(6, 'item'), (7, 'item')]
```
Because of the different types of edges, edge features also need to be separated by different types.
```python
import numpy as np
num_nodes = len(node_types)
node_features = {'features': np.random.randn(num_nodes, 8).astype("float32")}
edge_num_list = []
for edge_type in edges:
edge_num_list.append(len(edges[edge_type]))
edge_features = {
'click': {'h': np.random.randn(edge_num_list[0], 4)},
'buy': {'h':np.random.randn(edge_num_list[1], 4)},
}
```
Now, we can build a heterogenous graph by using PGL.
```python
import paddle.fluid as fluid
import paddle.fluid.layers as fl
import pgl
from pgl.contrib import heter_graph
from pgl.contrib import heter_graph_wrapper
g = heter_graph.HeterGraph(num_nodes=num_nodes,
edges=edges,
node_types=node_types,
node_feat=node_features,
edge_feat=edge_features)
```
In PGL, we need to use graph_wrapper as a container for graph data, so here we need to create a graph_wrapper for each type of edge graph.
```python
place = fluid.CPUPlace()
# create a GraphWrapper as a container for graph data
gw = heter_graph_wrapper.HeterGraphWrapper(name='heter_graph',
place = place,
edge_types = g.edge_types_info(),
node_feat=g.node_feat_info(),
edge_feat=g.edge_feat_info())
```
## MessagePassing
After building the heterogeneous graph, we can easily carry out the message passing mode. In this case, we have two different types of edges, so we can write a function in such way:
```python
def message_passing(gw, edge_types, features, name=''):
def __message(src_feat, dst_feat, edge_feat):
return src_feat['h']
def __reduce(feat):
return fluid.layers.sequence_pool(feat, pool_type='sum')
assert len(edge_types) == len(features)
output = []
for i in range(len(edge_types)):
msg = gw[edge_types[i]].send(__message, nfeat_list=[('h', features[i])])
out = gw[edge_types[i]].recv(msg, __reduce)
output.append(out)
# list of matrix
return output
```
```python
edge_types = ['click', 'buy']
features = []
for edge_type in edge_types:
features.append(gw[edge_type].node_feat['features'])
output = message_passing(gw, edge_types, features)
output = fl.concat(input=output, axis=1)
output = fluid.layers.fc(output, size=4, bias_attr=False, act='relu', name='fc1')
logits = fluid.layers.fc(output, size=1, bias_attr=False, act=None, name='fc2')
```
## data preprocessing
In this case, we implement a simple node classifier, we can use 0,1 to represent two classes.
```python
y = [0,1,0,1,0,1,1,0]
label = np.array(y, dtype="float32").reshape(-1,1)
```
## Setting up the training program
The training process of the heterogeneous graph node classification model is the same as the training of other paddlepaddle-based models.
* First we build the loss function;
* Second, creating a optimizer;
* Finally, creating a executor and execute the training program.
```python
node_label = fluid.layers.data("node_label", shape=[None, 1], dtype="float32", append_batch_size=False)
loss = fluid.layers.sigmoid_cross_entropy_with_logits(x=logits, label=node_label)
loss = fluid.layers.mean(loss)
adam = fluid.optimizer.Adam(learning_rate=0.01)
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
feed_dict = gw.to_feed(g)
for epoch in range(30):
feed_dict['node_label'] = label
train_loss = exe.run(fluid.default_main_program(), feed=feed_dict, fetch_list=[loss], return_numpy=True)
print('Epoch %d | Loss: %f'%(epoch, train_loss[0]))
```
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file loads and preprocesses the dataset for GATNE model.
"""
import sys
import os
import tqdm
import numpy as np
import logging
import random
from pgl import heter_graph
import pickle as pkl
class Dataset(object):
"""Implementation of Dataset class
This is a simple implementation of loading and processing dataset for GATNE model.
Args:
config: dict, some configure parameters.
"""
def __init__(self, config):
self.train_edges_file = config['data_path'] + 'train.txt'
self.valid_edges_file = config['data_path'] + 'valid.txt'
self.test_edges_file = config['data_path'] + 'test.txt'
self.nodes_file = config['data_path'] + 'nodes.txt'
self.config = config
self.word2index = self.load_word2index()
self.build_graph()
self.valid_data = self.load_test_data(self.valid_edges_file)
self.test_data = self.load_test_data(self.test_edges_file)
def build_graph(self):
"""Build pgl heterogeneous graph.
"""
edge_data_by_type, all_edges, all_nodes = self.load_training_data(
self.train_edges_file,
slf_loop=self.config['slf_loop'],
symmetry_edge=self.config['symmetry_edge'])
num_nodes = len(all_nodes)
node_features = {
'index': np.array(
[i for i in range(num_nodes)], dtype=np.int64).reshape(-1, 1)
}
self.graph = heter_graph.HeterGraph(
num_nodes=num_nodes,
edges=edge_data_by_type,
node_types=None,
node_feat=node_features)
self.edge_types = sorted(self.graph.edge_types_info())
logging.info('total %d nodes are loaded' % (self.graph.num_nodes))
def load_training_data(self, file_, slf_loop=True, symmetry_edge=True):
"""Load train data from file and preprocess them.
Args:
file_: str, file name for loading data
slf_loop: bool, if true, add self loop edge for every node
symmetry_edge: bool, if true, add symmetry edge for every edge
"""
logging.info('loading data from %s' % file_)
edge_data_by_type = dict()
all_edges = list()
all_nodes = list()
with open(file_, 'r') as reader:
for line in reader:
words = line.strip().split(' ')
if words[0] not in edge_data_by_type:
edge_data_by_type[words[0]] = []
src, dst = words[1], words[2]
edge_data_by_type[words[0]].append((src, dst))
all_edges.append((src, dst))
all_nodes.append(src)
all_nodes.append(dst)
if symmetry_edge:
edge_data_by_type[words[0]].append((dst, src))
all_edges.append((dst, src))
all_nodes = list(set(all_nodes))
all_edges = list(set(all_edges))
# edge_data_by_type['Base'] = all_edges
if slf_loop:
for e_type in edge_data_by_type.keys():
for n in all_nodes:
edge_data_by_type[e_type].append((n, n))
# remapping to index
edges_by_type = {}
for edge_type, edges in edge_data_by_type.items():
res_edges = []
for edge in edges:
res_edges.append(
(self.word2index[edge[0]], self.word2index[edge[1]]))
edges_by_type[edge_type] = res_edges
return edges_by_type, all_edges, all_nodes
def load_test_data(self, file_):
"""Load testing data from file.
"""
logging.info('loading data from %s' % file_)
true_edge_data_by_type = {}
fake_edge_data_by_type = {}
with open(file_, 'r') as reader:
for line in reader:
words = line.strip().split(' ')
src, dst = self.word2index[words[1]], self.word2index[words[2]]
e_type = words[0]
if int(words[3]) == 1: # true edges
if e_type not in true_edge_data_by_type:
true_edge_data_by_type[e_type] = list()
true_edge_data_by_type[e_type].append((src, dst))
else: # fake edges
if e_type not in fake_edge_data_by_type:
fake_edge_data_by_type[e_type] = list()
fake_edge_data_by_type[e_type].append((src, dst))
return (true_edge_data_by_type, fake_edge_data_by_type)
def load_word2index(self):
"""Load words(nodes) from file and map to index.
"""
word2index = {}
with open(self.nodes_file, 'r') as reader:
for index, line in enumerate(reader):
node = line.strip()
word2index[node] = index
return word2index
def generate_walks(self):
"""Generate random walks for every edge type.
"""
all_walks = {}
for e_type in self.edge_types:
layer_walks = self.simulate_walks(
edge_type=e_type,
num_walks=self.config['num_walks'],
walk_length=self.config['walk_length'])
all_walks[e_type] = layer_walks
return all_walks
def simulate_walks(self, edge_type, num_walks, walk_length, schema=None):
"""Generate random walks in specified edge type.
"""
walks = []
nodes = list(range(0, self.graph[edge_type].num_nodes))
for walk_iter in tqdm.tqdm(range(num_walks)):
random.shuffle(nodes)
for node in nodes:
walk = self.graph[edge_type].random_walk(
[node], max_depth=walk_length - 1)
for i in range(len(walk)):
walks.append(walk[i])
return walks
def generate_pairs(self, all_walks):
"""Generate word pairs for training.
"""
logging.info(['edge_types before generate pairs', self.edge_types])
pairs = []
skip_window = self.config['win_size'] // 2
for layer_id, e_type in enumerate(self.edge_types):
walks = all_walks[e_type]
for walk in tqdm.tqdm(walks):
for i in range(len(walk)):
for j in range(1, skip_window + 1):
if i - j >= 0 and walk[i] != walk[i - j]:
neg_nodes = self.graph[e_type].sample_nodes(
self.config['neg_num'])
pairs.append(
(walk[i], walk[i - j], *neg_nodes, layer_id))
if i + j < len(walk) and walk[i] != walk[i + j]:
neg_nodes = self.graph[e_type].sample_nodes(
self.config['neg_num'])
pairs.append(
(walk[i], walk[i + j], *neg_nodes, layer_id))
return pairs
def fetch_batch(self, pairs, batch_size, for_test=False):
"""Produce batch pairs data for training.
"""
np.random.shuffle(pairs)
n_batches = (len(pairs) + (batch_size - 1)) // batch_size
neg_num = len(pairs[0]) - 3
result = []
for i in range(1, n_batches):
batch_pairs = np.array(
pairs[batch_size * (i - 1):batch_size * i], dtype=np.int64)
x = batch_pairs[:, 0].reshape(-1, ).astype(np.int64)
y = batch_pairs[:, 1].reshape(-1, 1, 1).astype(np.int64)
neg = batch_pairs[:, 2:2 + neg_num].reshape(-1, neg_num,
1).astype(np.int64)
t = batch_pairs[:, -1].reshape(-1, 1).astype(np.int64)
result.append((x, y, neg, t))
return result
if __name__ == "__main__":
config = {
'data_path': './data/youtube/',
'train_pairs_file': 'train_pairs.pkl',
'slf_loop': True,
'symmetry_edge': True,
'num_walks': 20,
'walk_length': 10,
'win_size': 5,
'neg_num': 5,
}
log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s'
logging.basicConfig(level='INFO', format=log_format)
dataset = Dataset(config)
logging.info('generating walks')
all_walks = dataset.generate_walks()
logging.info('finishing generate walks')
logging.info(['length of all walks: ', all_walks.keys()])
train_pairs = dataset.generate_pairs(all_walks)
pkl.dump(train_pairs,
open(config['data_path'] + config['train_pairs_file'], 'wb'))
logging.info('finishing generate train_pairs')
# GATNE: General Attributed Multiplex HeTerogeneous Network Embedding
[GATNE](https://arxiv.org/pdf/1905.01669.pdf) is a algorithms framework for embedding large-scale Attributed Multiplex Heterogeneous Networks(AMHN). Given a heterogeneous graph, which consists of nodes and edges of multiple types, it can learn continuous feature representations for every node. Based on PGL, we reproduce GATNE algorithm.
## Datasets
YouTube dataset contains 2000 nodes, 1310617 edges and 5 edge types. And we use YouTube dataset for example.
You can dowload YouTube datasets from [here](https://github.com/THUDM/GATNE/tree/master/data)
After downloading the data, put them, let's say, in ./data/ . Note that the current directory is the root directory of GATNE model. Then in ./data/youtube/ directory, there are three files:
* train.txt
* valid.txt
* test.txt
Then you can run the below command to preprocess the data.
```sh
python data_process.py --input_file ./data/youtube/train.txt --output_file ./data/youtube/nodes.txt
```
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
## Hyperparameters
All the hyper parameters are saved in config.yaml file. So before training GATNE model, you can open the config.yaml to modify the hyper parameters as you like.
for example, you can change the \"use_cuda\" to \"True \" in order to use GPU for training or modify \"data_path\" to use different dataset.
Some important hyper parameters in config.yaml:
- use_cuda: use GPU to train model
- data_path: the directory of dataset
- lr: learning rate
- neg_num: number of negatie samples.
- num_walks: number of walks started from each node
- walk_length: walk length
## How to run
Then run the below command:
```sh
python main.py -c config.yaml
```
### Experiment results
| | PGL result | Reported result |
|:---:|------------|-----------------|
| AUC | 84.83 | 84.61 |
| PR | 82.77 | 81.93 |
| F1 | 76.98 | 76.83 |
task_name: train.gatne
use_cuda: True
log_level: info
seed: 1667
optimizer:
type:
args:
lr: 0.005
trainer:
type: trainer
args:
epochs: 2
log_dir: logs/
save_dir: checkpoints/
output_dir: outputs/
data_loader:
type: Dataset
args:
data_path: ./data/youtube/
train_pairs_file: train_pairs.pkl
batch_size: 256
num_walks: 20
walk_length: 10
win_size: 5
neg_num: 5
slf_loop: True
symmetry_edge: True
model:
type: GATNE
args:
dimensions: 200
edge_dim: 32
att_dim: 32
att_head: 1
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file preprocess the data before training.
"""
import sys
import argparse
def gen_nodes_file(file_, result_file):
"""calculate the total number of nodes and save them for latter processing.
"""
nodes = []
with open(file_, 'r') as reader:
for line in reader:
tokens = line.strip().split(' ')
nodes.append(tokens[1])
nodes.append(tokens[2])
nodes = list(set(nodes))
nodes.sort(key=int)
print('total number of nodes: %d' % len(nodes))
print('saving nodes file in %s' % (result_file))
with open(result_file, 'w') as writer:
for n in nodes:
writer.write(n + '\n')
print('finished')
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='GATNE')
parser.add_argument(
'--input_file',
default='./data/youtube/train.txt',
type=str,
help='input file')
parser.add_argument(
'--output_file',
default='./data/youtube/nodes.txt',
type=str,
help='output file')
args = parser.parse_args()
print('generating nodes file')
gen_nodes_file(args.input_file, args.output_file)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the training process of GATNE model.
"""
import os
import argparse
import time
import numpy as np
import logging
import pickle as pkl
import pgl
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as fl
from utils import *
import Dataset
import model as Model
from sklearn.metrics import (auc, f1_score, precision_recall_curve,
roc_auc_score)
def set_seed(seed):
"""Set random seed.
"""
random.seed(seed)
np.random.seed(seed)
def produce_model(exe, program, dataset, model, feed_dict):
"""Output the learned model parameters for testing.
"""
edge_types = dataset.edge_types
num_nodes = dataset.graph[edge_types[0]].num_nodes
edge_types_count = len(edge_types)
neg_num = dataset.config['neg_num']
final_model = {}
feed_dict['train_inputs'] = np.array(
[n for n in range(num_nodes)], dtype=np.int64).reshape(-1, )
feed_dict['train_labels'] = np.array(
[n for n in range(num_nodes)], dtype=np.int64).reshape(-1, 1, 1)
feed_dict['train_negs'] = np.tile(feed_dict['train_labels'],
(1, neg_num)).reshape(-1, neg_num, 1)
for i in range(edge_types_count):
feed_dict['train_types'] = np.array(
[i for _ in range(num_nodes)], dtype=np.int64).reshape(-1, 1)
edge_node_embed = exe.run(program,
feed=feed_dict,
fetch_list=[model.last_node_embed],
return_numpy=True)[0]
final_model[edge_types[i]] = edge_node_embed
return final_model
def evaluate(final_model, edge_types, data):
"""Calculate the AUC score, F1 score and PR score of the final model
"""
edge_types_count = len(edge_types)
AUC, F1, PR = [], [], []
true_edge_data_by_type = data[0]
fake_edge_data_by_type = data[1]
for i in range(edge_types_count):
try:
local_model = final_model[edge_types[i]]
true_edges = true_edge_data_by_type[edge_types[i]]
fake_edges = fake_edge_data_by_type[edge_types[i]]
except Exception as e:
logging.warn('edge type not exists. %s' % str(e))
continue
tmp_auc, tmp_f1, tmp_pr = calculate_score(local_model, true_edges,
fake_edges)
AUC.append(tmp_auc)
F1.append(tmp_f1)
PR.append(tmp_pr)
return {'AUC': np.mean(AUC), 'F1': np.mean(F1), 'PR': np.mean(PR)}
def calculate_score(model, true_edges, fake_edges):
"""Calculate the AUC score, F1 score and PR score of specified edge type
"""
true_list = list()
prediction_list = list()
true_num = 0
for edge in true_edges:
tmp_score = get_score(model, edge)
if tmp_score is not None:
true_list.append(1)
prediction_list.append(tmp_score)
true_num += 1
for edge in fake_edges:
tmp_score = get_score(model, edge)
if tmp_score is not None:
true_list.append(0)
prediction_list.append(tmp_score)
sorted_pred = prediction_list[:]
sorted_pred.sort()
threshold = sorted_pred[-true_num]
y_pred = np.zeros(len(prediction_list), dtype=np.int32)
for i in range(len(prediction_list)):
if prediction_list[i] >= threshold:
y_pred[i] = 1
y_true = np.array(true_list)
y_scores = np.array(prediction_list)
ps, rs, _ = precision_recall_curve(y_true, y_scores)
return roc_auc_score(y_true, y_scores), f1_score(y_true, y_pred), auc(rs,
ps)
def get_score(local_model, edge):
"""Calculate the cosine similarity score between two nodes.
"""
try:
vector1 = local_model[edge[0]]
vector2 = local_model[edge[1]]
return np.dot(vector1, vector2) / (np.linalg.norm(vector1) *
np.linalg.norm(vector2))
except Exception as e:
logging.warn('get_score warning: %s' % str(e))
return None
pass
def run_epoch(epoch,
config,
dataset,
data,
train_prog,
test_prog,
model,
feed_dict,
exe,
for_test=False):
"""Run training process of every epoch.
"""
total_loss = []
for idx, batch_data in enumerate(data):
feed_dict['train_inputs'] = batch_data[0]
feed_dict['train_labels'] = batch_data[1]
feed_dict['train_negs'] = batch_data[2]
feed_dict['train_types'] = batch_data[3]
loss, lr = exe.run(train_prog,
feed=feed_dict,
fetch_list=[model.loss, model.lr],
return_numpy=True)
total_loss.append(loss[0])
if (idx + 1) % 500 == 0:
avg_loss = np.mean(total_loss)
logging.info("epoch %d | step %d | lr %.4f | train_loss %f " %
(epoch, idx + 1, lr, avg_loss))
total_loss = []
return avg_loss
def save_model(program, exe, dataset, model, feed_dict, filename):
"""Save model.
"""
final_model = produce_model(exe, program, dataset, model, feed_dict)
logging.info('saving model in %s' % (filename))
pkl.dump(final_model, open(filename, 'wb'))
def test(program, exe, dataset, model, feed_dict):
"""Testing and validating.
"""
final_model = produce_model(exe, program, dataset, model, feed_dict)
valid_result = evaluate(final_model, dataset.edge_types,
dataset.valid_data)
test_result = evaluate(final_model, dataset.edge_types, dataset.test_data)
logging.info("valid_AUC %.4f | valid_PR %.4f | valid_F1 %.4f" %
(valid_result['AUC'], valid_result['PR'], valid_result['F1']))
logging.info("test_AUC %.4f | test_PR %.4f | test_F1 %.4f" %
(test_result['AUC'], test_result['PR'], test_result['F1']))
return test_result
def main(config):
"""main function for training GATNE model.
"""
logging.info(config)
set_seed(config['seed'])
dataset = getattr(
Dataset, config['data_loader']['type'])(config['data_loader']['args'])
edge_types = dataset.graph.edge_types_info()
logging.info(['total edge types: ', edge_types])
# train_pairs is a list of tuple: [(src1, dst1, neg, e1), (src2, dst2, neg, e2)]
# e(int), edge num count, for select which edge embedding
train_pairs_file = config['data_loader']['args']['data_path'] + \
config['data_loader']['args']['train_pairs_file']
if os.path.exists(train_pairs_file):
logging.info('loading train pairs from pkl file %s' % train_pairs_file)
train_pairs = pkl.load(open(train_pairs_file, 'rb'))
else:
logging.info('generating walks')
all_walks = dataset.generate_walks()
logging.info('generating train pairs')
train_pairs = dataset.generate_pairs(all_walks)
logging.info('dumping train pairs to %s' % (train_pairs_file))
pkl.dump(train_pairs, open(train_pairs_file, 'wb'))
logging.info('total train pairs: %d' % (len(train_pairs)))
data = dataset.fetch_batch(train_pairs,
config['data_loader']['args']['batch_size'])
place = fluid.CUDAPlace(0) if config['use_cuda'] else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
test_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
model = getattr(Model, config['model']['type'])(
config['model']['args'], dataset, place)
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
global_steps = len(data) * config['trainer']['args']['epochs']
model.backward(global_steps, config['optimizer']['args'])
# train
exe = fluid.Executor(place)
exe.run(startup_program)
feed_dict = model.gw.to_feed(dataset.graph)
logging.info('test before training...')
test(test_program, exe, dataset, model, feed_dict)
logging.info('training...')
for epoch in range(1, 1 + config['trainer']['args']['epochs']):
train_result = run_epoch(epoch, config['trainer']['args'], dataset,
data, train_program, test_program, model,
feed_dict, exe)
logging.info('validating and testing...')
test_result = test(test_program, exe, dataset, model, feed_dict)
filename = os.path.join(config['trainer']['args']['save_dir'],
'dict_embed_model_epoch_%d.pkl' % (epoch))
save_model(test_program, exe, dataset, model, feed_dict, filename)
logging.info(
"final_test_AUC %.4f | final_test_PR %.4f | fianl_test_F1 %.4f" % (
test_result['AUC'], test_result['PR'], test_result['F1']))
logging.info('training finished')
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='GATNE')
parser.add_argument(
'-c',
'--config',
default=None,
type=str,
help='config file path (default: None)')
parser.add_argument(
'-n',
'--taskname',
default=None,
type=str,
help='task name(default: None)')
args = parser.parse_args()
if args.config:
# load config file
config = Config(args.config, isCreate=True, isSave=True)
config = config()
else:
raise AssertionError(
"Configuration file need to be specified. Add '-c config.yaml', for example."
)
log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s'
logging.basicConfig(
level=getattr(logging, config['log_level'].upper()), format=log_format)
main(config)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the GATNE model.
"""
import numpy as np
import math
import logging
import paddle.fluid as fluid
import paddle.fluid.layers as fl
from pgl import heter_graph_wrapper
class GATNE(object):
"""Implementation of GATNE model.
Args:
config: dict, some configure parameters.
dataset: instance of Dataset class
place: GPU or CPU place
"""
def __init__(self, config, dataset, place):
logging.info(['model is: ', self.__class__.__name__])
self.config = config
self.graph = dataset.graph
self.placce = place
self.edge_types = sorted(self.graph.edge_types_info())
logging.info('edge_types in model: %s' % str(self.edge_types))
neg_num = dataset.config['neg_num']
# hyper parameters
self.num_nodes = self.graph.num_nodes
self.embedding_size = self.config['dimensions']
self.embedding_u_size = self.config['edge_dim']
self.dim_a = self.config['att_dim']
self.att_head = self.config['att_head']
self.edge_type_count = len(self.edge_types)
self.u_num = self.edge_type_count
self.gw = heter_graph_wrapper.HeterGraphWrapper(
name="heter_graph",
place=place,
edge_types=self.graph.edge_types_info(),
node_feat=self.graph.node_feat_info(),
edge_feat=self.graph.edge_feat_info())
self.train_inputs = fl.data(
'train_inputs', shape=[None], dtype='int64')
self.train_labels = fl.data(
'train_labels', shape=[None, 1, 1], dtype='int64')
self.train_types = fl.data(
'train_types', shape=[None, 1], dtype='int64')
self.train_negs = fl.data(
'train_negs', shape=[None, neg_num, 1], dtype='int64')
self.forward()
def forward(self):
"""Build the GATNE net.
"""
param_attr_init = fluid.initializer.Uniform(
low=-1.0, high=1.0, seed=np.random.randint(100))
embed_param_attrs = fluid.ParamAttr(
name='Base_node_embed', initializer=param_attr_init)
# node_embeddings
base_node_embed = fl.embedding(
input=fl.reshape(
self.train_inputs, shape=[-1, 1]),
size=[self.num_nodes, self.embedding_size],
param_attr=embed_param_attrs)
node_features = []
for edge_type in self.edge_types:
param_attr_init = fluid.initializer.Uniform(
low=-1.0, high=1.0, seed=np.random.randint(100))
embed_param_attrs = fluid.ParamAttr(
name='%s_node_embed' % edge_type, initializer=param_attr_init)
features = fl.embedding(
input=self.gw[edge_type].node_feat['index'],
size=[self.num_nodes, self.embedding_u_size],
param_attr=embed_param_attrs)
node_features.append(features)
# mp_output: list of embedding(self.num_nodes, dim)
mp_output = self.message_passing(self.gw, self.edge_types,
node_features)
# U : (num_type[m], num_nodes, dim[s])
node_type_embed = fl.stack(mp_output, axis=0)
# U : (num_nodes, num_type[m], dim[s])
node_type_embed = fl.transpose(node_type_embed, perm=[1, 0, 2])
#gather node_type_embed from train_inputs
node_type_embed = fl.gather(node_type_embed, self.train_inputs)
# M_r
trans_weights = fl.create_parameter(
shape=[
self.edge_type_count, self.embedding_u_size,
self.embedding_size // self.att_head
],
attr=fluid.initializer.TruncatedNormalInitializer(
loc=0.0, scale=1.0 / math.sqrt(self.embedding_size)),
dtype='float32',
name='trans_w')
# W_r
trans_weights_s1 = fl.create_parameter(
shape=[self.edge_type_count, self.embedding_u_size, self.dim_a],
attr=fluid.initializer.TruncatedNormalInitializer(
loc=0.0, scale=1.0 / math.sqrt(self.embedding_size)),
dtype='float32',
name='trans_w_s1')
# w_r
trans_weights_s2 = fl.create_parameter(
shape=[self.edge_type_count, self.dim_a, self.att_head],
attr=fluid.initializer.TruncatedNormalInitializer(
loc=0.0, scale=1.0 / math.sqrt(self.embedding_size)),
dtype='float32',
name='trans_w_s2')
trans_w = fl.gather(trans_weights, self.train_types)
trans_w_s1 = fl.gather(trans_weights_s1, self.train_types)
trans_w_s2 = fl.gather(trans_weights_s2, self.train_types)
attention = self.attention(node_type_embed, trans_w_s1, trans_w_s2)
node_type_embed = fl.matmul(attention, node_type_embed)
node_embed = base_node_embed + fl.reshape(
fl.matmul(node_type_embed, trans_w), [-1, self.embedding_size])
self.last_node_embed = fl.l2_normalize(node_embed, axis=1)
nce_weight_initializer = fluid.initializer.TruncatedNormalInitializer(
loc=0.0, scale=1.0 / math.sqrt(self.embedding_size))
nce_weight_attrs = fluid.ParamAttr(
name='nce_weight', initializer=nce_weight_initializer)
weight_pos = fl.embedding(
input=self.train_labels,
size=[self.num_nodes, self.embedding_size],
param_attr=nce_weight_attrs)
weight_neg = fl.embedding(
input=self.train_negs,
size=[self.num_nodes, self.embedding_size],
param_attr=nce_weight_attrs)
tmp_node_embed = fl.unsqueeze(self.last_node_embed, axes=[1])
pos_logits = fl.matmul(
tmp_node_embed, weight_pos, transpose_y=True) # [B, 1, 1]
neg_logits = fl.matmul(
tmp_node_embed, weight_neg, transpose_y=True) # [B, 1, neg_num]
pos_score = fl.squeeze(pos_logits, axes=[1])
pos_score = fl.clip(pos_score, min=-10, max=10)
pos_score = -1.0 * fl.logsigmoid(pos_score)
neg_score = fl.squeeze(neg_logits, axes=[1])
neg_score = fl.clip(neg_score, min=-10, max=10)
neg_score = -1.0 * fl.logsigmoid(-1.0 * neg_score)
neg_score = fl.reduce_sum(neg_score, dim=1, keep_dim=True)
self.loss = fl.reduce_mean(pos_score + neg_score)
def attention(self, node_type_embed, trans_w_s1, trans_w_s2):
"""Calculate attention weights.
"""
attention = fl.tanh(fl.matmul(node_type_embed, trans_w_s1))
attention = fl.matmul(attention, trans_w_s2)
attention = fl.reshape(attention, [-1, self.u_num])
attention = fl.softmax(attention)
attention = fl.reshape(attention, [-1, self.att_head, self.u_num])
return attention
def message_passing(self, gw, edge_types, features, name=''):
"""Message passing from source nodes to dstination nodes
"""
def __message(src_feat, dst_feat, edge_feat):
"""send function
"""
return src_feat['h']
def __reduce(feat):
"""recv function
"""
return fluid.layers.sequence_pool(feat, pool_type='average')
if not isinstance(edge_types, list):
edge_types = [edge_types]
if not isinstance(features, list):
features = [features]
assert len(edge_types) == len(features)
output = []
for i in range(len(edge_types)):
msg = gw[edge_types[i]].send(
__message, nfeat_list=[('h', features[i])])
neigh_feat = gw[edge_types[i]].recv(msg, __reduce)
neigh_feat = fl.fc(neigh_feat,
size=neigh_feat.shape[-1],
name='neigh_fc_%d' % (i),
act='sigmoid')
slf_feat = fl.fc(features[i],
size=neigh_feat.shape[-1],
name='slf_fc_%d' % (i),
act='sigmoid')
out = fluid.layers.concat([slf_feat, neigh_feat], axis=1)
out = fl.fc(out, size=neigh_feat.shape[-1], name='fc', act=None)
out = fluid.layers.l2_normalize(out, axis=1)
output.append(out)
# list of matrix
return output
def backward(self, global_steps, opt_config):
"""Build the optimizer.
"""
self.lr = fl.polynomial_decay(opt_config['lr'], global_steps, 0.001)
adam = fluid.optimizer.Adam(learning_rate=self.lr)
adam.minimize(self.loss)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement a class for model configure.
"""
import datetime
import os
import yaml
import random
import shutil
class Config(object):
"""Implementation of Config class for model configure.
Args:
config_file(str): configure filename, which is a yaml file.
isCreate(bool): if true, create some neccessary directories to save models, log file and other outputs.
isSave(bool): if true, save config_file in order to record the configure message.
"""
def __init__(self, config_file, isCreate=False, isSave=False):
self.config_file = config_file
self.config = self.get_config_from_yaml(config_file)
if isCreate:
self.create_necessary_dirs()
if isSave:
self.save_config_file()
def get_config_from_yaml(self, yaml_file):
"""Get the configure hyperparameters from yaml file.
"""
try:
with open(yaml_file, 'r') as f:
config = yaml.load(f)
except Exception:
raise IOError("Error in parsing config file '%s'" % yaml_file)
return config
def create_necessary_dirs(self):
"""Create some necessary directories to save some important files.
"""
time_stamp = datetime.datetime.now().strftime('%m%d_%H%M')
self.config['trainer']['args']['log_dir'] = ''.join(
(self.config['trainer']['args']['log_dir'],
self.config['task_name'], '/')) # , '.%s/' % (time_stamp)))
self.config['trainer']['args']['save_dir'] = ''.join(
(self.config['trainer']['args']['save_dir'],
self.config['task_name'], '/')) # , '.%s/' % (time_stamp)))
self.config['trainer']['args']['output_dir'] = ''.join(
(self.config['trainer']['args']['output_dir'],
self.config['task_name'], '/')) # , '.%s/' % (time_stamp)))
# if os.path.exists(self.config['trainer']['args']['save_dir']):
# input('save_dir is existed, do you really want to continue?')
self.make_dir(self.config['trainer']['args']['log_dir'])
self.make_dir(self.config['trainer']['args']['save_dir'])
self.make_dir(self.config['trainer']['args']['output_dir'])
def save_config_file(self):
"""Save config file so that we can know the config when we look back
"""
filename = self.config_file.split('/')[-1]
targetpath = self.config['trainer']['args']['save_dir']
shutil.copyfile(self.config_file, targetpath + filename)
def make_dir(self, path):
"""Build directory"""
if not os.path.exists(path):
os.makedirs(path)
def __getitem__(self, key):
"""Return the configure dict"""
return self.config[key]
def __call__(self):
"""__call__"""
return self.config
# DGI: Deep Graph Infomax
[Deep Graph Infomax \(DGI\)](https://arxiv.org/abs/1809.10341) is a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.6
- pgl
### Performance
We use DGI to pretrain embeddings for each nodes. Then we fix the embedding to train a node classifier.
| Dataset | Accuracy |
| --- | --- |
| Cora | ~81% |
| Pubmed | ~77.6% |
| Citeseer | ~71.3% |
### How to run
For examples, use gpu to train gcn on cora dataset.
```
python dgi.py --dataset cora --use_cuda
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
DGI Pretrain
"""
import os
import pgl
from pgl import data_loader
from pgl.utils.logger import log
import paddle.fluid as fluid
import numpy as np
import time
import argparse
def load(name):
"""Load dataset"""
if name == 'cora':
dataset = data_loader.CoraDataset()
elif name == "pubmed":
dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
elif name == "citeseer":
dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def save_param(dirname, var_name_list):
"""save_param"""
for var_name in var_name_list:
var = fluid.global_scope().find_var(var_name)
var_tensor = var.get_tensor()
np.save(os.path.join(dirname, var_name + '.npy'), np.array(var_tensor))
def main(args):
"""main"""
dataset = load(args.dataset)
# normalize
indegree = dataset.graph.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
hidden_size = 512
with fluid.program_guard(train_program, startup_program):
pos_gw = pgl.graph_wrapper.GraphWrapper(
name="pos_graph",
place=place,
node_feat=dataset.graph.node_feat_info())
neg_gw = pgl.graph_wrapper.GraphWrapper(
name="neg_graph",
place=place,
node_feat=dataset.graph.node_feat_info())
positive_feat = pgl.layers.gcn(pos_gw,
pos_gw.node_feat["words"],
hidden_size,
activation="relu",
norm=pos_gw.node_feat['norm'],
name="gcn_layer_1")
negative_feat = pgl.layers.gcn(neg_gw,
neg_gw.node_feat["words"],
hidden_size,
activation="relu",
norm=neg_gw.node_feat['norm'],
name="gcn_layer_1")
summary_feat = fluid.layers.sigmoid(
fluid.layers.reduce_mean(
positive_feat, [0], keep_dim=True))
summary_feat = fluid.layers.fc(summary_feat,
hidden_size,
bias_attr=False,
name="discriminator")
pos_logits = fluid.layers.matmul(
positive_feat, summary_feat, transpose_y=True)
neg_logits = fluid.layers.matmul(
negative_feat, summary_feat, transpose_y=True)
pos_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
x=pos_logits,
label=fluid.layers.ones(
shape=[dataset.graph.num_nodes, 1], dtype="float32"))
neg_loss = fluid.layers.sigmoid_cross_entropy_with_logits(
x=neg_logits,
label=fluid.layers.zeros(
shape=[dataset.graph.num_nodes, 1], dtype="float32"))
loss = fluid.layers.reduce_mean(pos_loss) + fluid.layers.reduce_mean(
neg_loss)
adam = fluid.optimizer.Adam(learning_rate=1e-3)
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(startup_program)
best_loss = 1e9
dur = []
for epoch in range(args.epoch):
feed_dict = pos_gw.to_feed(dataset.graph)
node_feat = dataset.graph.node_feat["words"].copy()
perm = np.arange(0, dataset.graph.num_nodes)
np.random.shuffle(perm)
dataset.graph.node_feat["words"] = dataset.graph.node_feat["words"][
perm]
feed_dict.update(neg_gw.to_feed(dataset.graph))
dataset.graph.node_feat["words"] = node_feat
if epoch >= 3:
t0 = time.time()
train_loss = exe.run(train_program,
feed=feed_dict,
fetch_list=[loss],
return_numpy=True)
if train_loss[0] < best_loss:
best_loss = train_loss[0]
save_param(args.checkpoint, ["gcn_layer_1", "gcn_layer_1_bias"])
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
"Train Loss: %f " % train_loss[0])
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='DGI pretrain')
parser.add_argument(
"--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
parser.add_argument(
"--checkpoint", type=str, default="best_model", help="checkpoint")
parser.add_argument(
"--epoch", type=int, default=200, help="pretrain epochs")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
args = parser.parse_args()
log.info(args)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Train
"""
import os
import pgl
from pgl import data_loader
from pgl.utils.logger import log
import paddle.fluid as fluid
import numpy as np
import time
import argparse
def load(name):
"""Load"""
if name == 'cora':
dataset = data_loader.CoraDataset()
elif name == "pubmed":
dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
elif name == "citeseer":
dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def load_param(dirname, var_name_list):
"""load_param"""
for var_name in var_name_list:
var = fluid.global_scope().find_var(var_name)
var_tensor = var.get_tensor()
var_tmp = np.load(os.path.join(dirname, var_name + '.npy'))
var_tensor.set(var_tmp, fluid.CPUPlace())
def main(args):
"""main"""
dataset = load(args.dataset)
# normalize
indegree = dataset.graph.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
test_program = fluid.Program()
hidden_size = 512
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.GraphWrapper(
name="graph",
place=place,
node_feat=dataset.graph.node_feat_info())
output = pgl.layers.gcn(gw,
gw.node_feat["words"],
hidden_size,
activation="relu",
norm=gw.node_feat['norm'],
name="gcn_layer_1")
output.stop_gradient = True
output = fluid.layers.fc(output,
dataset.num_classes,
act=None,
name="classifier")
node_index = fluid.layers.data(
"node_index",
shape=[None, 1],
dtype="int64",
append_batch_size=False)
node_label = fluid.layers.data(
"node_label",
shape=[None, 1],
dtype="int64",
append_batch_size=False)
pred = fluid.layers.gather(output, node_index)
loss, pred = fluid.layers.softmax_with_cross_entropy(
logits=pred, label=node_label, return_softmax=True)
acc = fluid.layers.accuracy(input=pred, label=node_label, k=1)
loss = fluid.layers.mean(loss)
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=1e-2)
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(startup_program)
load_param(args.checkpoint, ["gcn_layer_1", "gcn_layer_1_bias"])
feed_dict = gw.to_feed(dataset.graph)
train_index = dataset.train_index
train_label = np.expand_dims(dataset.y[train_index], -1)
train_index = np.expand_dims(train_index, -1)
val_index = dataset.val_index
val_label = np.expand_dims(dataset.y[val_index], -1)
val_index = np.expand_dims(val_index, -1)
test_index = dataset.test_index
test_label = np.expand_dims(dataset.y[test_index], -1)
test_index = np.expand_dims(test_index, -1)
dur = []
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
feed_dict["node_index"] = np.array(train_index, dtype="int64")
feed_dict["node_label"] = np.array(train_label, dtype="int64")
train_loss, train_acc = exe.run(train_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
feed_dict["node_index"] = np.array(val_index, dtype="int64")
feed_dict["node_label"] = np.array(val_label, dtype="int64")
val_loss, val_acc = exe.run(test_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(dur) +
"Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+ "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
feed_dict["node_index"] = np.array(test_index, dtype="int64")
feed_dict["node_label"] = np.array(test_label, dtype="int64")
test_loss, test_acc = exe.run(test_program,
feed=feed_dict,
fetch_list=[loss, acc],
return_numpy=True)
log.info("Accuracy: %f" % test_acc)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='GCN')
parser.add_argument(
"--dataset", type=str, default="cora", help="dataset (cora, pubmed)")
parser.add_argument(
"--checkpoint", type=str, default="best_model", help="checkpoint")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
args = parser.parse_args()
log.info(args)
main(args)
# Distributed Deepwalk in PGL
[Deepwalk](https://arxiv.org/pdf/1403.6652.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce distributed deepwalk algorithms and reach the same level of indicators as the paper.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3).
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0
## How to run
We adopt [PaddlePaddle Fleet](https://github.com/PaddlePaddle/Fleet) as our distributed training frameworks ```pgl_deepwalk.cfg``` is config file for deepwalk hyperparameter and ```local_config``` is a config file for parameter servers. By default, we have 2 pservers and 2 trainers. We can use ```cloud_run.sh``` to help you startup the parameter servers and model trainers.
For examples, train deepwalk in distributed mode on BlogCataLog dataset.
```sh
# train deepwalk in distributed mode.
sh cloud_run.sh
# multiclass task example
python3 multi_class.py --use_cuda --ckpt_path ./model_path/4029 --epoch 1000
```
## Hyperparameters
- dataset: The citation dataset "BlogCatalog".
- hidden_size: Hidden size of the embedding.
- lr: Learning rate.
- neg_num: Number of negative samples.
- epoch: Number of training epoch.
### Experiment results
Dataset|model|Task|Metric|PGL Result|Reported Result
--|--|--|--|--|--
BlogCatalog|distributed deepwalk|multi-label classification|MacroF1|0.233|0.211
#!/bin/bash
set -x
source ./pgl_deepwalk.cfg
source ./local_config
unset http_proxy https_proxy
# build train_data
trainer_num=`echo $PADDLE_PORT | awk -F',' '{print NF}'`
rm -rf train_data && mkdir -p train_data
cd train_data
if [[ $build_train_data == True ]];then
seq 0 $((num_nodes-1)) | shuf | split -l $((num_nodes/trainer_num/CPU_NUM+1))
else
for i in `seq 1 $trainer_num`;do
touch $i
done
fi
cd -
# mkdir workspace
if [ -d ${BASE} ]; then
rm -rf ${BASE}
fi
mkdir ${BASE}
# start ps
for((i=0;i<${PADDLE_PSERVERS_NUM};i++))
do
echo "start ps server: ${i}"
echo $BASE
TRAINING_ROLE="PSERVER" PADDLE_TRAINER_ID=${i} sh job.sh &> $BASE/pserver.$i.log &
done
sleep 5s
# start trainers
for((j=0;j<${PADDLE_TRAINERS_NUM};j++))
do
echo "start ps work: ${j}"
TRAINING_ROLE="TRAINER" PADDLE_TRAINER_ID=${j} sh job.sh &> $BASE/worker.$j.log &
done
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import os
import math
from multiprocessing import Process
import numpy as np
import paddle.fluid as F
import paddle.fluid.layers as L
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from pgl.utils.logger import log
from pgl import data_loader
from reader import DeepwalkReader
from model import DeepwalkModel
from utils import get_file_list
from utils import build_graph
from utils import build_fake_graph
from utils import build_gen_func
def init_role():
# reset the place according to role of parameter server
training_role = os.getenv("TRAINING_ROLE", "TRAINER")
paddle_role = role_maker.Role.WORKER
place = F.CPUPlace()
if training_role == "PSERVER":
paddle_role = role_maker.Role.SERVER
# set the fleet runtime environment according to configure
ports = os.getenv("PADDLE_PORT", "6174").split(",")
pserver_ips = os.getenv("PADDLE_PSERVERS").split(",") # ip,ip...
eplist = []
if len(ports) > 1:
# local debug mode, multi port
for port in ports:
eplist.append(':'.join([pserver_ips[0], port]))
else:
# distributed mode, multi ip
for ip in pserver_ips:
eplist.append(':'.join([ip, ports[0]]))
pserver_endpoints = eplist # ip:port,ip:port...
worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
role = role_maker.UserDefinedRoleMaker(
current_id=trainer_id,
role=paddle_role,
worker_num=worker_num,
server_endpoints=pserver_endpoints)
fleet.init(role)
def optimization(base_lr, loss, train_steps, optimizer='sgd'):
decayed_lr = L.learning_rate_scheduler.polynomial_decay(
learning_rate=base_lr,
decay_steps=train_steps,
end_learning_rate=0.0001 * base_lr,
power=1.0,
cycle=False)
if optimizer == 'sgd':
optimizer = F.optimizer.SGD(decayed_lr)
elif optimizer == 'adam':
optimizer = F.optimizer.Adam(decayed_lr, lazy_mode=True)
else:
raise ValueError
log.info('learning rate:%f' % (base_lr))
#create the DistributeTranspiler configure
config = DistributeTranspilerConfig()
config.sync_mode = False
#config.runtime_split_send_recv = False
config.slice_var_up = False
#create the distributed optimizer
optimizer = fleet.distributed_optimizer(optimizer, config)
optimizer.minimize(loss)
def build_complied_prog(train_program, model_loss):
num_threads = int(os.getenv("CPU_NUM", 10))
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
exec_strategy = F.ExecutionStrategy()
exec_strategy.num_threads = num_threads
#exec_strategy.use_experimental_executor = True
build_strategy = F.BuildStrategy()
build_strategy.enable_inplace = True
#build_strategy.memory_optimize = True
build_strategy.memory_optimize = False
build_strategy.remove_unnecessary_lock = False
if num_threads > 1:
build_strategy.reduce_strategy = F.BuildStrategy.ReduceStrategy.Reduce
compiled_prog = F.compiler.CompiledProgram(
train_program).with_data_parallel(
loss_name=model_loss.name)
return compiled_prog
def train_prog(exe, program, loss, node2vec_pyreader, args, train_steps):
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
step = 0
while True:
try:
begin_time = time.time()
loss_val, = exe.run(program, fetch_list=[loss])
log.info("step %s: loss %.5f speed: %.5f s/step" %
(step, np.mean(loss_val), time.time() - begin_time))
step += 1
except F.core.EOFException:
node2vec_pyreader.reset()
if step % args.steps_per_save == 0 or step == train_steps:
if trainer_id == 0 or args.is_distributed:
model_save_dir = args.save_path
model_path = os.path.join(model_save_dir, str(step))
if not os.path.exists(model_save_dir):
os.makedirs(model_save_dir)
fleet.save_persistables(exe, model_path)
if step == train_steps:
break
def test(args):
graph = build_graph(args.num_nodes, args.edge_path)
gen_func = build_gen_func(args, graph)
start = time.time()
num = 10
for idx, _ in enumerate(gen_func()):
if idx % num == num - 1:
log.info("%s" % (1.0 * (time.time() - start) / num))
start = time.time()
def walk(args):
graph = build_graph(args.num_nodes, args.edge_path)
num_sample_workers = args.num_sample_workers
if args.train_files is None or args.train_files == "None":
log.info("Walking from graph...")
train_files = [None for _ in range(num_sample_workers)]
else:
log.info("Walking from train_data...")
files = get_file_list(args.train_files)
train_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(files):
train_files[idx % num_sample_workers].append(f)
def walk_to_file(walk_gen, filename, max_num):
with open(filename, "w") as outf:
num = 0
for walks in walk_gen:
for walk in walks:
outf.write("%s\n" % "\t".join([str(i) for i in walk]))
num += 1
if num % 1000 == 0:
log.info("Total: %s, %s walkpath is saved. " %
(max_num, num))
if num == max_num:
return
m_args = [(DeepwalkReader(
graph,
batch_size=args.batch_size,
walk_len=args.walk_len,
win_size=args.win_size,
neg_num=args.neg_num,
neg_sample_type=args.neg_sample_type,
walkpath_files=None,
train_files=train_files[i]).walk_generator(),
"%s/%s" % (args.walkpath_files, i),
args.epoch * args.num_nodes // args.num_sample_workers)
for i in range(num_sample_workers)]
ps = []
for i in range(num_sample_workers):
p = Process(target=walk_to_file, args=m_args[i])
p.start()
ps.append(p)
for i in range(num_sample_workers):
ps[i].join()
def train(args):
import logging
log.setLevel(logging.DEBUG)
log.info("start")
worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
num_devices = int(os.getenv("CPU_NUM", 10))
model = DeepwalkModel(args.num_nodes, args.hidden_size, args.neg_num,
args.is_sparse, args.is_distributed, 1.)
pyreader = model.pyreader
loss = model.forward()
# init fleet
init_role()
train_steps = math.ceil(1. * args.num_nodes * args.epoch /
args.batch_size / num_devices / worker_num)
log.info("Train step: %s" % train_steps)
if args.optimizer == "sgd":
args.lr *= args.batch_size * args.walk_len * args.win_size
optimization(args.lr, loss, train_steps, args.optimizer)
# init and run server or worker
if fleet.is_server():
fleet.init_server(args.warm_start_from_dir)
fleet.run_server()
if fleet.is_worker():
log.info("start init worker done")
fleet.init_worker()
#just the worker, load the sample
log.info("init worker done")
exe = F.Executor(F.CPUPlace())
exe.run(fleet.startup_program)
log.info("Startup done")
if args.dataset is not None:
if args.dataset == "BlogCatalog":
graph = data_loader.BlogCatalogDataset().graph
elif args.dataset == "ArXiv":
graph = data_loader.ArXivDataset().graph
else:
raise ValueError(args.dataset + " dataset doesn't exists")
log.info("Load buildin BlogCatalog dataset done.")
elif args.walkpath_files is None or args.walkpath_files == "None":
graph = build_graph(args.num_nodes, args.edge_path)
log.info("Load graph from '%s' done." % args.edge_path)
else:
graph = build_fake_graph(args.num_nodes)
log.info("Load fake graph done.")
# bind gen
gen_func = build_gen_func(args, graph)
pyreader.decorate_tensor_provider(gen_func)
pyreader.start()
compiled_prog = build_complied_prog(fleet.main_program, loss)
train_prog(exe, compiled_prog, loss, pyreader, args, train_steps)
if __name__ == '__main__':
def str2bool(v):
if isinstance(v, bool):
return v
if v.lower() in ('yes', 'true', 't', 'y', '1'):
return True
elif v.lower() in ('no', 'false', 'f', 'n', '0'):
return False
else:
raise argparse.ArgumentTypeError('Boolean value expected.')
parser = argparse.ArgumentParser(description='Deepwalk')
parser.add_argument(
"--hidden_size",
type=int,
default=64,
help="Hidden size of the embedding.")
parser.add_argument(
"--lr", type=float, default=0.025, help="Learning rate.")
parser.add_argument(
"--neg_num", type=int, default=5, help="Number of negative samples.")
parser.add_argument(
"--epoch", type=int, default=1, help="Number of training epoch.")
parser.add_argument(
"--batch_size",
type=int,
default=128,
help="Numbert of walk paths in a batch.")
parser.add_argument(
"--walk_len", type=int, default=40, help="Length of a walk path.")
parser.add_argument(
"--win_size", type=int, default=5, help="Window size in skip-gram.")
parser.add_argument(
"--save_path",
type=str,
default="model_path",
help="Output path for saving model.")
parser.add_argument(
"--num_sample_workers",
type=int,
default=1,
help="Number of sampling workers.")
parser.add_argument(
"--steps_per_save",
type=int,
default=3000,
help="Steps for model saveing.")
parser.add_argument(
"--num_nodes",
type=int,
default=10000,
help="Number of nodes in graph.")
parser.add_argument("--edge_path", type=str, default="./graph_data")
parser.add_argument("--train_files", type=str, default=None)
parser.add_argument("--walkpath_files", type=str, default=None)
parser.add_argument("--is_distributed", type=str2bool, default=False)
parser.add_argument("--is_sparse", type=str2bool, default=True)
parser.add_argument("--warm_start_from_dir", type=str, default=None)
parser.add_argument("--dataset", type=str, default=None)
parser.add_argument(
"--neg_sample_type",
type=str,
default="average",
choices=["average", "outdegree"])
parser.add_argument(
"--mode",
type=str,
required=False,
choices=['train', 'walk'],
default="train")
parser.add_argument(
"--optimizer",
type=str,
required=False,
choices=['adam', 'sgd'],
default="sgd")
args = parser.parse_args()
log.info(args)
if args.mode == "train":
train(args)
elif args.mode == "walk":
walk(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import os
import numpy as np
import paddle.fluid as F
import paddle.fluid.layers as L
from pgl.utils.logger import log
from model import DeepwalkModel
from utils import build_graph
from utils import build_gen_func
def optimization(base_lr, loss, train_steps, optimizer='adam'):
decayed_lr = L.polynomial_decay(base_lr, train_steps, 0.0001)
if optimizer == 'sgd':
optimizer = F.optimizer.SGD(
decayed_lr,
regularization=F.regularizer.L2DecayRegularizer(
regularization_coeff=0.0025))
elif optimizer == 'adam':
# dont use gpu's lazy mode
optimizer = F.optimizer.Adam(decayed_lr)
else:
raise ValueError
log.info('learning rate:%f' % (base_lr))
optimizer.minimize(loss)
def get_parallel_exe(program, loss):
exec_strategy = F.ExecutionStrategy()
exec_strategy.num_threads = 1 #2 for fp32 4 for fp16
exec_strategy.use_experimental_executor = True
exec_strategy.num_iteration_per_drop_scope = 1 #important shit
build_strategy = F.BuildStrategy()
build_strategy.enable_inplace = True
build_strategy.memory_optimize = True
build_strategy.remove_unnecessary_lock = True
#return compiled_prog
train_exe = F.ParallelExecutor(
use_cuda=True,
loss_name=loss.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy,
main_program=program)
return train_exe
def train(train_exe, exe, program, loss, node2vec_pyreader, args, train_steps):
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
step = 0
while True:
try:
begin_time = time.time()
loss_val, = train_exe.run(fetch_list=[loss])
log.info("step %s: loss %.5f speed: %.5f s/step" %
(step, np.mean(loss_val), time.time() - begin_time))
step += 1
except F.core.EOFException:
node2vec_pyreader.reset()
if (step == train_steps or
step % args.steps_per_save == 0) and trainer_id == 0:
model_save_dir = args.output_path
model_path = os.path.join(model_save_dir, str(step))
if not os.path.exists(model_save_dir):
os.makedirs(model_save_dir)
F.io.save_params(exe, model_path, program)
if step == train_steps:
break
def main(args):
import logging
log.setLevel(logging.DEBUG)
log.info("start")
num_devices = len(F.cuda_places())
model = DeepwalkModel(args.num_nodes, args.hidden_size, args.neg_num,
False, False, 1.)
pyreader = model.pyreader
loss = model.forward()
train_steps = int(args.num_nodes * args.epoch / args.batch_size /
num_devices)
optimization(args.lr * num_devices, loss, train_steps, args.optimizer)
place = F.CUDAPlace(0)
exe = F.Executor(place)
exe.run(F.default_startup_program())
graph = build_graph(args.num_nodes, args.edge_path)
gen_func = build_gen_func(args, graph)
pyreader.decorate_tensor_provider(gen_func)
pyreader.start()
train_prog = F.default_main_program()
if args.warm_start_from_dir is not None:
F.io.load_params(exe, args.warm_start_from_dir, train_prog)
train_exe = get_parallel_exe(train_prog, loss)
train(train_exe, exe, train_prog, loss, pyreader, args, train_steps)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Deepwalk')
parser.add_argument("--hidden_size", type=int, default=64)
parser.add_argument("--lr", type=float, default=0.025)
parser.add_argument("--neg_num", type=int, default=5)
parser.add_argument("--epoch", type=int, default=100)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--walk_len", type=int, default=40)
parser.add_argument("--win_size", type=int, default=5)
parser.add_argument("--output_path", type=str, default="output")
parser.add_argument("--num_sample_workers", type=int, default=1)
parser.add_argument("--steps_per_save", type=int, default=3000)
parser.add_argument("--num_nodes", type=int, default=10000)
parser.add_argument("--edge_path", type=str, default="./graph_data")
parser.add_argument("--walkpath_files", type=str, default=None)
parser.add_argument("--train_files", type=str, default="./train_data")
parser.add_argument("--warm_start_from_dir", type=str, default=None)
parser.add_argument(
"--neg_sample_type",
type=str,
default="average",
choices=["average", "outdegree"])
parser.add_argument(
"--optimizer",
type=str,
required=False,
choices=['adam', 'sgd'],
default="adam")
args = parser.parse_args()
log.info(args)
main(args)
#!/bin/bash
set -x
source ./pgl_deepwalk.cfg
export CPU_NUM=$CPU_NUM
export FLAGS_rpc_deadline=3000000
export FLAGS_communicator_send_queue_size=1
export FLAGS_communicator_min_send_grad_num_before_recv=0
export FLAGS_communicator_max_merge_var_num=1
export FLAGS_communicator_merge_sparse_grad=1
if [[ $build_train_data == True ]];then
train_files="./train_data"
else
train_files="None"
fi
if [[ $pre_walk == True ]]; then
walkpath_files="./walk_path"
if [[ $TRAINING_ROLE == "PSERVER" ]];then
while [[ ! -d train_data ]];do
sleep 60
echo "Waiting for train_data ..."
done
rm -rf $walkpath_files && mkdir -p $walkpath_files
python -u cluster_train.py --num_sample_workers $num_sample_workers --num_nodes $num_nodes --mode walk --walkpath_files $walkpath_files --epoch $epoch \
--walk_len $walk_len --batch_size $batch_size --train_files $train_files --dataset "BlogCatalog"
touch build_graph_done
fi
while [[ ! -f build_graph_done ]];do
sleep 60
echo "Waiting for walk_path ..."
done
else
walkpath_files="None"
fi
python -u cluster_train.py --num_sample_workers $num_sample_workers --num_nodes $num_nodes --optimizer $optimizer --walkpath_files $walkpath_files --epoch $epoch \
--is_distributed $distributed_embedding --lr $learning_rate --neg_num $neg_num --walk_len $walk_len --win_size $win_size --is_sparse $is_sparse --hidden_size $dim \
--batch_size $batch_size --steps_per_save $steps_per_save --train_files $train_files --dataset "BlogCatalog"
#!/bin/bash
export PADDLE_TRAINERS_NUM=2
export PADDLE_PSERVERS_NUM=2
export PADDLE_PORT=6184,6185
export PADDLE_PSERVERS="127.0.0.1"
export BASE="./local_dir"
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Deepwalk model file.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import math
import paddle.fluid.layers as L
import paddle.fluid as F
def split_embedding(input,
dict_size,
hidden_size,
initializer,
name,
num_part=16,
is_sparse=False,
learning_rate=1.0):
""" split_embedding
"""
_part_size = hidden_size // num_part
if hidden_size % num_part != 0:
_part_size += 1
output_embedding = []
p_num = 0
while hidden_size > 0:
_part_size = min(_part_size, hidden_size)
hidden_size -= _part_size
print("part", p_num, "size=", (dict_size, _part_size))
part_embedding = L.embedding(
input=input,
size=(dict_size, _part_size),
is_sparse=is_sparse,
is_distributed=False,
param_attr=F.ParamAttr(
name=name + '_part%s' % p_num,
initializer=initializer,
learning_rate=learning_rate))
p_num += 1
output_embedding.append(part_embedding)
return L.concat(output_embedding, -1)
class DeepwalkModel(object):
def __init__(self,
num_nodes,
hidden_size=16,
neg_num=5,
is_sparse=False,
is_distributed=False,
embedding_lr=1.0):
self.pyreader = L.py_reader(
capacity=70,
shapes=[[-1, 1, 1], [-1, neg_num + 1, 1]],
dtypes=['int64', 'int64'],
lod_levels=[0, 0],
name='train',
use_double_buffer=True)
self.num_nodes = num_nodes
self.neg_num = neg_num
self.embed_init = F.initializer.Uniform(
low=-1. / math.sqrt(hidden_size), high=1. / math.sqrt(hidden_size))
self.is_sparse = is_sparse
self.is_distributed = is_distributed
self.hidden_size = hidden_size
self.loss = None
self.embedding_lr = embedding_lr
max_hidden_size = int(math.pow(2, 31) / 4 / num_nodes)
self.num_part = int(math.ceil(1. * hidden_size / max_hidden_size))
def forward(self):
src, dsts = L.read_file(self.pyreader)
if self.is_sparse:
# sparse mode use 2 dims input.
src = L.reshape(src, [-1, 1])
dsts = L.reshape(dsts, [-1, 1])
if self.num_part is not None and self.num_part != 1 and not self.is_distributed:
src_embed = split_embedding(
src,
self.num_nodes,
self.hidden_size,
self.embed_init,
"weight",
self.num_part,
self.is_sparse,
learning_rate=self.embedding_lr)
dsts_embed = split_embedding(
dsts,
self.num_nodes,
self.hidden_size,
self.embed_init,
"weight",
self.num_part,
self.is_sparse,
learning_rate=self.embedding_lr)
else:
src_embed = L.embedding(
src, (self.num_nodes, self.hidden_size),
self.is_sparse,
self.is_distributed,
param_attr=F.ParamAttr(
name="weight",
learning_rate=self.embedding_lr,
initializer=self.embed_init))
dsts_embed = L.embedding(
dsts, (self.num_nodes, self.hidden_size),
self.is_sparse,
self.is_distributed,
param_attr=F.ParamAttr(
name="weight",
learning_rate=self.embedding_lr,
initializer=self.embed_init))
if self.is_sparse:
# reshape back
src_embed = L.reshape(src_embed, [-1, 1, self.hidden_size])
dsts_embed = L.reshape(dsts_embed,
[-1, self.neg_num + 1, self.hidden_size])
logits = L.matmul(
src_embed, dsts_embed,
transpose_y=True) # [batch_size, 1, neg_num+1]
pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", 1)
neg_label = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 0)
label = L.concat([pos_label, neg_label], -1)
pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", self.neg_num)
neg_weight = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 1)
weight = L.concat([pos_weight, neg_weight], -1)
weight.stop_gradient = True
label.stop_gradient = True
loss = L.sigmoid_cross_entropy_with_logits(logits, label)
loss = loss * weight
loss = L.reduce_mean(loss)
loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
loss.persistable = True
self.loss = loss
return loss
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimized Multiprocessing Reader for PaddlePaddle
"""
import multiprocessing
import numpy as np
import time
import paddle.fluid as fluid
import pyarrow
def _serialize_serializable(obj):
"""Serialize Feed Dict
"""
return {"type": type(obj), "data": obj.__dict__}
def _deserialize_serializable(obj):
"""Deserialize Feed Dict
"""
val = obj["type"].__new__(obj["type"])
val.__dict__.update(obj["data"])
return val
context = pyarrow.default_serialization_context()
context.register_type(
object,
"object",
custom_serializer=_serialize_serializable,
custom_deserializer=_deserialize_serializable)
def serialize_data(data):
"""serialize_data"""
return pyarrow.serialize(data, context=context).to_buffer().to_pybytes()
def deserialize_data(data):
"""deserialize_data"""
return pyarrow.deserialize(data, context=context)
def multiprocess_reader(readers, use_pipe=True, queue_size=1000):
"""
multiprocess_reader use python multi process to read data from readers
and then use multiprocess.Queue or multiprocess.Pipe to merge all
data. The process number is equal to the number of input readers, each
process call one reader.
Multiprocess.Queue require the rw access right to /dev/shm, some
platform does not support.
you need to create multiple readers first, these readers should be independent
to each other so that each process can work independently.
An example:
.. code-block:: python
reader0 = reader(["file01", "file02"])
reader1 = reader(["file11", "file12"])
reader1 = reader(["file21", "file22"])
reader = multiprocess_reader([reader0, reader1, reader2],
queue_size=100, use_pipe=False)
"""
assert type(readers) is list and len(readers) > 0
def _read_into_queue(reader, queue):
"""read_into_queue"""
for sample in reader():
if sample is None:
raise ValueError("sample has None")
queue.put(serialize_data(sample))
queue.put(serialize_data(None))
def queue_reader():
"""queue_reader"""
queue = multiprocessing.Queue(queue_size)
for reader in readers:
p = multiprocessing.Process(
target=_read_into_queue, args=(reader, queue))
p.start()
reader_num = len(readers)
finish_num = 0
while finish_num < reader_num:
sample = deserialize_data(queue.get())
if sample is None:
finish_num += 1
else:
yield sample
def _read_into_pipe(reader, conn):
"""read_into_pipe"""
for sample in reader():
if sample is None:
raise ValueError("sample has None!")
conn.send(serialize_data(sample))
conn.send(serialize_data(None))
conn.close()
def pipe_reader():
"""pipe_reader"""
conns = []
for reader in readers:
parent_conn, child_conn = multiprocessing.Pipe()
conns.append(parent_conn)
p = multiprocessing.Process(
target=_read_into_pipe, args=(reader, child_conn))
p.start()
reader_num = len(readers)
finish_num = 0
conn_to_remove = []
finish_flag = np.zeros(len(conns), dtype="int32")
while finish_num < reader_num:
for conn_id, conn in enumerate(conns):
if finish_flag[conn_id] > 0:
continue
buff = conn.recv()
now = time.time()
sample = deserialize_data(buff)
out = time.time() - now
if sample is None:
finish_num += 1
conn.close()
finish_flag[conn_id] = 1
else:
yield sample
if use_pipe:
return pipe_reader
else:
return queue_reader
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import math
import os
import numpy as np
import sklearn.metrics
from sklearn.metrics import f1_score
import pgl
from pgl import data_loader
from pgl.utils import op
from pgl.utils.logger import log
import paddle.fluid as fluid
import paddle.fluid.layers as l
np.random.seed(123)
def load(name):
if name == 'BlogCatalog':
dataset = data_loader.BlogCatalogDataset()
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def node_classify_model(graph,
num_labels,
hidden_size=16,
name='node_classify_task'):
pyreader = l.py_reader(
capacity=70,
shapes=[[-1, 1], [-1, num_labels]],
dtypes=['int64', 'float32'],
lod_levels=[0, 0],
name=name + '_pyreader',
use_double_buffer=True)
nodes, labels = l.read_file(pyreader)
embed_nodes = l.embedding(
input=nodes,
size=[graph.num_nodes, hidden_size],
param_attr=fluid.ParamAttr(name='weight'))
embed_nodes.stop_gradient = True
logits = l.fc(input=embed_nodes, size=num_labels)
loss = l.sigmoid_cross_entropy_with_logits(logits, labels)
loss = l.reduce_mean(loss)
prob = l.sigmoid(logits)
topk = l.reduce_sum(labels, -1)
return pyreader, loss, prob, labels, topk
def node_classify_generator(graph,
all_nodes=None,
batch_size=512,
epoch=1,
shuffle=True):
if all_nodes is None:
all_nodes = np.arange(graph.num_nodes)
#labels = (np.random.rand(512, 39) > 0.95).astype(np.float32)
def batch_nodes_generator(shuffle=shuffle):
perm = np.arange(len(all_nodes), dtype=np.int64)
if shuffle:
np.random.shuffle(perm)
start = 0
while start < len(all_nodes):
yield all_nodes[perm[start:start + batch_size]]
start += batch_size
def wrapper():
for _ in range(epoch):
for batch_nodes in batch_nodes_generator():
batch_nodes_expanded = np.expand_dims(batch_nodes,
-1).astype(np.int64)
batch_labels = graph.node_feat['group_id'][batch_nodes].astype(
np.float32)
yield [batch_nodes_expanded, batch_labels]
return wrapper
def topk_f1_score(labels,
probs,
topk_list=None,
average="macro",
threshold=None):
assert topk_list is not None or threshold is not None, "one of topklist and threshold should not be None"
if threshold is not None:
preds = probs > threshold
else:
preds = np.zeros_like(labels, dtype=np.int64)
for idx, (prob, topk) in enumerate(zip(np.argsort(probs), topk_list)):
preds[idx][prob[-int(topk):]] = 1
return f1_score(labels, preds, average=average)
def main(args):
hidden_size = args.hidden_size
epoch = args.epoch
ckpt_path = args.ckpt_path
threshold = args.threshold
dataset = load(args.dataset)
if args.batch_size is None:
batch_size = len(dataset.train_index)
else:
batch_size = args.batch_size
train_steps = (len(dataset.train_index) // batch_size) * epoch
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_prog = fluid.Program()
test_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
with fluid.unique_name.guard():
train_pyreader, train_loss, train_probs, train_labels, train_topk = node_classify_model(
dataset.graph,
dataset.num_groups,
hidden_size=hidden_size,
name='train')
lr = l.polynomial_decay(0.025, train_steps, 0.0001)
adam = fluid.optimizer.Adam(lr)
adam.minimize(train_loss)
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_pyreader, test_loss, test_probs, test_labels, test_topk = node_classify_model(
dataset.graph,
dataset.num_groups,
hidden_size=hidden_size,
name='test')
test_prog = test_prog.clone(for_test=True)
exe = fluid.Executor(place)
exe.run(startup_prog)
train_pyreader.decorate_tensor_provider(
node_classify_generator(
dataset.graph,
dataset.train_index,
batch_size=batch_size,
epoch=epoch))
test_pyreader.decorate_tensor_provider(
node_classify_generator(
dataset.graph, dataset.test_index, batch_size=batch_size, epoch=1))
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(ckpt_path, var.name))
fluid.io.load_vars(
exe, ckpt_path, main_program=train_prog, predicate=existed_params)
step = 0
prev_time = time.time()
train_pyreader.start()
while 1:
try:
train_loss_val, train_probs_val, train_labels_val, train_topk_val = exe.run(
train_prog,
fetch_list=[
train_loss, train_probs, train_labels, train_topk
],
return_numpy=True)
train_macro_f1 = topk_f1_score(train_labels_val, train_probs_val,
train_topk_val, "macro", threshold)
train_micro_f1 = topk_f1_score(train_labels_val, train_probs_val,
train_topk_val, "micro", threshold)
step += 1
log.info("Step %d " % step + "Train Loss: %f " % train_loss_val +
"Train Macro F1: %f " % train_macro_f1 +
"Train Micro F1: %f " % train_micro_f1)
except fluid.core.EOFException:
train_pyreader.reset()
break
test_pyreader.start()
test_probs_vals, test_labels_vals, test_topk_vals = [], [], []
while 1:
try:
test_loss_val, test_probs_val, test_labels_val, test_topk_val = exe.run(
test_prog,
fetch_list=[
test_loss, test_probs, test_labels, test_topk
],
return_numpy=True)
test_probs_vals.append(
test_probs_val), test_labels_vals.append(test_labels_val)
test_topk_vals.append(test_topk_val)
except fluid.core.EOFException:
test_pyreader.reset()
test_probs_array = np.concatenate(test_probs_vals)
test_labels_array = np.concatenate(test_labels_vals)
test_topk_array = np.concatenate(test_topk_vals)
test_macro_f1 = topk_f1_score(
test_labels_array, test_probs_array, test_topk_array,
"macro", threshold)
test_micro_f1 = topk_f1_score(
test_labels_array, test_probs_array, test_topk_array,
"micro", threshold)
log.info("\t\tStep %d " % step + "Test Loss: %f " %
test_loss_val + "Test Macro F1: %f " % test_macro_f1 +
"Test Micro F1: %f " % test_micro_f1)
break
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='node2vec')
parser.add_argument(
"--dataset",
type=str,
default="BlogCatalog",
help="dataset (BlogCatalog)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--epoch", type=int, default=400)
parser.add_argument("--batch_size", type=int, default=None)
parser.add_argument("--threshold", type=float, default=0.3)
parser.add_argument(
"--ckpt_path",
type=str,
default="./tmp/baseline_node2vec/paddle_model")
args = parser.parse_args()
log.info(args)
main(args)
# deepwalk config
num_nodes=10312 # max node_id + 1
num_sample_workers=2
epoch=100
optimizer=sgd # sgd or adam
learning_rate=0.5
neg_num=5
walk_len=40
win_size=5
dim=128
batch_size=8
steps_per_save=5000
is_sparse=False
distributed_embedding=False # only use when num_nodes > 100,000,000, slower than noraml embedding
build_train_data=True
pre_walk=False
CPU_NUM=16
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Reader file.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
import time
import io
import os
import numpy as np
import paddle
from pgl.utils.logger import log
from pgl.sample import node2vec_sample
from pgl.sample import deepwalk_sample
from pgl.sample import alias_sample
from pgl.graph_kernel import skip_gram_gen_pair
from pgl.graph_kernel import alias_sample_build_table
from pgl.utils import mp_reader
class DeepwalkReader(object):
def __init__(self,
graph,
batch_size=512,
walk_len=40,
win_size=5,
neg_num=5,
train_files=None,
walkpath_files=None,
neg_sample_type="average"):
"""
Args:
walkpath_files: if is not None, read walk path from walkpath_files
"""
self.graph = graph
self.batch_size = batch_size
self.walk_len = walk_len
self.win_size = win_size
self.neg_num = neg_num
self.train_files = train_files
self.walkpath_files = walkpath_files
self.neg_sample_type = neg_sample_type
def walk_from_files(self):
bucket = []
while True:
for filename in self.walkpath_files:
with io.open(filename) as inf:
for line in inf:
#walk = [hash_map[x] for x in line.strip('\n\t').split('\t')]
walk = [int(x) for x in line.strip('\n\t').split('\t')]
bucket.append(walk)
if len(bucket) == self.batch_size:
yield bucket
bucket = []
if len(bucket):
yield bucket
def walk_from_graph(self):
def node_generator():
if self.train_files is None:
while True:
for nodes in self.graph.node_batch_iter(self.batch_size):
yield nodes
else:
nodes = []
while True:
for filename in self.train_files:
with io.open(filename) as inf:
for line in inf:
node = int(line.strip('\n\t'))
nodes.append(node)
if len(nodes) == self.batch_size:
yield nodes
nodes = []
if len(nodes):
yield nodes
if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
log.info("Deepwalk using alias sample")
for nodes in node_generator():
if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
walks = deepwalk_sample(self.graph, nodes, self.walk_len,
"alias", "events")
else:
walks = deepwalk_sample(self.graph, nodes, self.walk_len)
yield walks
def walk_generator(self):
if self.walkpath_files is not None:
for i in self.walk_from_files():
yield i
else:
for i in self.walk_from_graph():
yield i
def __call__(self):
np.random.seed(os.getpid())
if self.neg_sample_type == "outdegree":
outdegree = self.graph.outdegree()
distribution = 1. * outdegree / outdegree.sum()
alias, events = alias_sample_build_table(distribution)
max_len = int(self.batch_size * self.walk_len * (
(1 + self.win_size) - 0.3))
for walks in self.walk_generator():
try:
src_list, pos_list = [], []
for walk in walks:
s, p = skip_gram_gen_pair(walk, self.win_size)
src_list.append(s[:max_len]), pos_list.append(p[:max_len])
src = [s for x in src_list for s in x]
pos = [s for x in pos_list for s in x]
src = np.array(src, dtype=np.int64),
pos = np.array(pos, dtype=np.int64)
src, pos = np.reshape(src, [-1, 1, 1]), np.reshape(pos,
[-1, 1, 1])
neg_sample_size = [len(pos), self.neg_num, 1]
if src.shape[0] == 0:
continue
if self.neg_sample_type == "average":
negs = np.random.randint(
low=0, high=self.graph.num_nodes, size=neg_sample_size)
elif self.neg_sample_type == "outdegree":
negs = alias_sample(neg_sample_size, alias, events)
elif self.neg_sample_type == "inbatch":
pass
dst = np.concatenate([pos, negs], 1)
# [batch_size, 1, 1] [batch_size, neg_num+1, 1]
yield src[:max_len], dst[:max_len]
except Exception as e:
log.exception(e)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Utils file.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import os
import time
import numpy as np
from pgl.utils.logger import log
from pgl.graph import Graph
from pgl.sample import graph_alias_sample_table
from reader import DeepwalkReader
import mp_reader
def get_file_list(path):
filelist = []
if os.path.isfile(path):
filelist = [path]
elif os.path.isdir(path):
filelist = [
os.path.join(dp, f)
for dp, dn, filenames in os.walk(path) for f in filenames
]
else:
raise ValueError(path + " not supported")
return filelist
def build_graph(num_nodes, edge_path):
filelist = []
if os.path.isfile(edge_path):
filelist = [edge_path]
elif os.path.isdir(edge_path):
filelist = [
os.path.join(dp, f)
for dp, dn, filenames in os.walk(edge_path) for f in filenames
]
else:
raise ValueError(edge_path + " not supported")
edges, edge_weight = [], []
for name in filelist:
with open(name) as inf:
for line in inf:
slots = line.strip("\n").split()
edges.append([slots[0], slots[1]])
edges.append([slots[1], slots[0]])
if len(slots) > 2:
edge_weight.extend([float(slots[2]), float(slots[2])])
edges = np.array(edges, dtype="int64")
assert num_nodes > edges.max(
), "Node id in any edges should be smaller then num_nodes!"
edge_feat = dict()
if len(edge_weight) == len(edges):
edge_feat["weight"] = np.array(edge_weight)
graph = Graph(num_nodes, edges, edge_feat=edge_feat)
log.info("Build graph done")
graph.outdegree()
del edges, edge_feat
log.info("Build graph index done")
if "weight" in graph.edge_feat:
graph.node_feat["alias"], graph.node_feat[
"events"] = graph_alias_sample_table(graph, "weight")
log.info("Build graph alias sample table done")
return graph
def build_fake_graph(num_nodes):
class FakeGraph():
pass
graph = FakeGraph()
graph.num_nodes = num_nodes
return graph
def build_gen_func(args, graph):
num_sample_workers = args.num_sample_workers
if args.walkpath_files is None or args.walkpath_files == "None":
walkpath_files = [None for _ in range(num_sample_workers)]
else:
files = get_file_list(args.walkpath_files)
walkpath_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(files):
walkpath_files[idx % num_sample_workers].append(f)
if args.train_files is None or args.train_files == "None":
train_files = [None for _ in range(num_sample_workers)]
else:
files = get_file_list(args.train_files)
train_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(files):
train_files[idx % num_sample_workers].append(f)
gen_func_pool = [
DeepwalkReader(
graph,
batch_size=args.batch_size,
walk_len=args.walk_len,
win_size=args.win_size,
neg_num=args.neg_num,
neg_sample_type=args.neg_sample_type,
walkpath_files=walkpath_files[i],
train_files=train_files[i]) for i in range(num_sample_workers)
]
if num_sample_workers == 1:
gen_func = gen_func_pool[0]
else:
gen_func = mp_reader.multiprocess_reader(
gen_func_pool, use_pipe=True, queue_size=100)
return gen_func
def test_gen_speed(gen_func):
cur_time = time.time()
for idx, _ in enumerate(gen_func()):
log.info("iter %s: %s s" % (idx, time.time() - cur_time))
cur_time = time.time()
if idx == 100:
break
# GraphSage for Large Scale Networks
# Distribute GraphSAGE in PGL
[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.
### Datasets
The reddit dataset should be downloaded from the following links and placed in directory ```./data```. The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
For purpose of high scalability, we use redis as distribute graph storage solution and training graphsage against redis server.
- reddit.npz https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
- reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
### Datasets(Quickstart)
The reddit dataset should be downloaded from [reddit_adj.npz](https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt) and [reddit.npz](https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2Jthe). The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
Alternatively, reddit dataset has been preprocessed and packed into docker image, which can be instantly pulled using following commands.
```sh
docker pull githubutilities/reddit_redis_demo:v0.1
```
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
```txt
- paddlepaddle>=1.6
- pgl
- scipy
- redis==2.10.6
- redis-py-cluster==1.3.6
```
### How to run
To train a GraphSAGE model on Reddit Dataset, you can just run
#### 1. Start reddit data service
```sh
docker run \
--net=host \
-d --rm \
--name reddit_demo \
-it githubutilities/reddit_redis_demo:v0.1 \
/bin/bash -c "/bin/bash ./before_hook.sh && /bin/bash"
docker logs -f `docker ps -aqf "name=reddit_demo"`
```
python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry
#### 2. training GraphSAGE model
```sh
python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --sample_workers 10
```
#### Hyperparameters
......@@ -27,28 +49,11 @@ To train a GraphSAGE model on Reddit Dataset, you can just run
- epoch: Number of epochs default (10)
- use_cuda: Use gpu if assign use_cuda.
- graphsage_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm".
- normalize: Normalize the input feature if assign normalize.
- sample_workers: The number of workers for multiprocessing subgraph sample.
- lr: Learning rate.
- symmetry: Make the edges symmetric if assign symmetry.
- batch_size: Batch size.
- samples_1: The max neighbors for the first hop neighbor sampling. (default: 25)
- samples_2: The max neighbors for the second hop neighbor sampling. (default: 10)
- hidden_size: The hidden size of the GraphSAGE models.
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Aggregator | Accuracy | Reported in paper |
| --- | --- | --- |
| Mean | 95.70% | 95.0% |
| Meanpool | 95.60% | 94.8% |
| Maxpool | 94.95% | 94.8% |
| LSTM | 95.13% | 95.4% |
### View the Code
See the code [here](graphsage_examples_code.html).
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
import paddle.fluid as fluid
def copy_send(src_feat, dst_feat, edge_feat):
return src_feat["h"]
def mean_recv(feat):
return fluid.layers.sequence_pool(feat, pool_type="average")
def sum_recv(feat):
return fluid.layers.sequence_pool(feat, pool_type="sum")
def max_recv(feat):
return fluid.layers.sequence_pool(feat, pool_type="max")
def lstm_recv(feat):
hidden_dim = 128
forward, _ = fluid.layers.dynamic_lstm(
input=feat, size=hidden_dim * 4, use_peepholes=False)
output = fluid.layers.sequence_last_step(forward)
return output
def graphsage_mean(gw, feature, hidden_size, act, name):
msg = gw.send(copy_send, nfeat_list=[("h", feature)])
neigh_feature = gw.recv(msg, mean_recv)
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_meanpool(gw,
feature,
hidden_size,
act,
name,
inner_hidden_size=512):
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
neigh_feature = gw.recv(msg, mean_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_maxpool(gw,
feature,
hidden_size,
act,
name,
inner_hidden_size=512):
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
neigh_feature = gw.recv(msg, max_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_lstm(gw, feature, hidden_size, act, name):
inner_hidden_size = 128
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
hidden_dim = 128
forward_proj = fluid.layers.fc(input=neigh_feature,
size=hidden_dim * 4,
bias_attr=False,
name="lstm_proj")
msg = gw.send(copy_send, nfeat_list=[("h", forward_proj)])
neigh_feature = gw.recv(msg, lstm_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import pickle as pkl
import paddle
import paddle.fluid as fluid
import socket
import pgl
import time
from pgl.utils import mp_reader
from pgl.utils.logger import log
from pgl import redis_graph
def node_batch_iter(nodes, node_label, batch_size):
"""node_batch_iter
"""
perm = np.arange(len(nodes))
np.random.shuffle(perm)
start = 0
while start < len(nodes):
index = perm[start:start + batch_size]
start += batch_size
yield nodes[index], node_label[index]
def traverse(item):
"""traverse
"""
if isinstance(item, list) or isinstance(item, np.ndarray):
for i in iter(item):
for j in traverse(i):
yield j
else:
yield item
def flat_node_and_edge(nodes, eids):
"""flat_node_and_edge
"""
nodes = list(set(traverse(nodes)))
eids = list(set(traverse(eids)))
return nodes, eids
def worker(batch_info, graph_wrapper, samples):
"""Worker
"""
def work():
"""work
"""
redis_configs = [{
"host": socket.gethostbyname(socket.gethostname()),
"port": 7430
}, ]
graph = redis_graph.RedisGraph("sub_graph", redis_configs, 64)
first = True
for batch_train_samples, batch_train_labels in batch_info:
start_nodes = batch_train_samples
nodes = start_nodes
eids = []
eid2edges = {}
for max_deg in samples:
pred, pred_eid = graph.sample_predecessor(
start_nodes, max_degree=max_deg, return_eids=True)
for _dst, _srcs, _eids in zip(start_nodes, pred, pred_eid):
for _src, _eid in zip(_srcs, _eids):
eid2edges[_eid] = (_src, _dst)
last_nodes = nodes
nodes = [nodes, pred]
eids = [eids, pred_eid]
nodes, eids = flat_node_and_edge(nodes, eids)
# Find new nodes
start_nodes = list(set(nodes) - set(last_nodes))
if len(start_nodes) == 0:
break
subgraph = graph.subgraph(nodes=nodes, eid=eids, edges=[ eid2edges[e] for e in eids])
sub_node_index = subgraph.reindex_from_parrent_nodes(
batch_train_samples)
feed_dict = graph_wrapper.to_feed(subgraph)
feed_dict["node_label"] = np.expand_dims(
np.array(
batch_train_labels, dtype="int64"), -1)
feed_dict["node_index"] = sub_node_index
yield feed_dict
return work
def multiprocess_graph_reader(
graph_wrapper,
samples,
node_index,
batch_size,
node_label,
num_workers=4):
"""multiprocess_graph_reader
"""
def parse_to_subgraph(rd):
"""parse_to_subgraph
"""
def work():
"""work
"""
last = time.time()
for data in rd():
this = time.time()
feed_dict = data
now = time.time()
last = now
yield feed_dict
return work
def reader():
"""reader"""
batch_info = list(
node_batch_iter(
node_index, node_label, batch_size=batch_size))
block_size = int(len(batch_info) / num_workers + 1)
reader_pool = []
for i in range(num_workers):
reader_pool.append(
worker(batch_info[block_size * i:block_size * (i + 1)],
graph_wrapper, samples))
multi_process_sample = mp_reader.multiprocess_reader(
reader_pool, use_pipe=True, queue_size=1000)
r = parse_to_subgraph(multi_process_sample)
return paddle.reader.buffered(r, 1000)
return reader()
scipy
redis==2.10.6
redis-py-cluster==1.3.6
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import time
import numpy as np
import scipy.sparse as sp
from sklearn.preprocessing import StandardScaler
import pgl
from pgl.utils.logger import log
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import reader
from model import graphsage_mean, graphsage_meanpool,\
graphsage_maxpool, graphsage_lstm
def load_data():
"""
data from https://github.com/matenure/FastGCN/issues/8
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
reddit_index_label is preprocess from reddit.npz without feats key.
"""
data_dir = os.path.dirname(os.path.abspath(__file__))
data = np.load(os.path.join(data_dir, "data/reddit_index_label.npz"))
num_class = 41
train_label = data['y_train']
val_label = data['y_val']
test_label = data['y_test']
train_index = data['train_index']
val_index = data['val_index']
test_index = data['test_index']
return {
"train_index": train_index,
"train_label": train_label,
"val_label": val_label,
"val_index": val_index,
"test_index": test_index,
"test_label": test_label,
"num_class": 41
}
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size):
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int64", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
#feature = fluid.layers.gather(feature, graph_wrapper.node_feat['feats'])
feature = graph_wrapper.node_feat['feats']
feature.stop_gradient = True
for i in range(k_hop):
if graphsage_type == 'graphsage_mean':
feature = graphsage_mean(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_mean_%s" % i)
elif graphsage_type == 'graphsage_meanpool':
feature = graphsage_meanpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_meanpool_%s" % i)
elif graphsage_type == 'graphsage_maxpool':
feature = graphsage_maxpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
elif graphsage_type == 'graphsage_lstm':
feature = graphsage_lstm(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
else:
raise ValueError("graphsage type %s is not"
" implemented" % graphsage_type)
feature = fluid.layers.gather(feature, node_index)
logits = fluid.layers.fc(feature,
num_class,
act=None,
name='classification_layer')
proba = fluid.layers.softmax(logits)
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=node_label)
loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
return loss, acc
def run_epoch(batch_iter,
exe,
program,
prefix,
model_loss,
model_acc,
epoch,
log_per_step=100):
batch = 0
total_loss = 0.
total_acc = 0.
total_sample = 0
start = time.time()
for batch_feed_dict in batch_iter():
batch += 1
batch_loss, batch_acc = exe.run(program,
fetch_list=[model_loss, model_acc],
feed=batch_feed_dict)
if batch % log_per_step == 0:
log.info("Batch %s %s-Loss %s %s-Acc %s" %
(batch, prefix, batch_loss, prefix, batch_acc))
num_samples = len(batch_feed_dict["node_index"])
total_loss += batch_loss * num_samples
total_acc += batch_acc * num_samples
total_sample += num_samples
end = time.time()
log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
(prefix, epoch, total_loss / total_sample,
total_acc / total_sample, (end - start) / batch))
def main(args):
data = load_data()
log.info("preprocess finish")
log.info("Train Examples: %s" % len(data["train_index"]))
log.info("Val Examples: %s" % len(data["val_index"]))
log.info("Test Examples: %s" % len(data["test_index"]))
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
samples = []
if args.samples_1 > 0:
samples.append(args.samples_1)
if args.samples_2 > 0:
samples.append(args.samples_2)
with fluid.program_guard(train_program, startup_program):
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph", fluid.CPUPlace(), node_feat=[('feats', [None, 602], np.dtype('float32'))])
model_loss, model_acc = build_graph_model(
graph_wrapper,
num_class=data["num_class"],
hidden_size=args.hidden_size,
graphsage_type=args.graphsage_type,
k_hop=len(samples))
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=args.lr)
adam.minimize(model_loss)
exe = fluid.Executor(place)
exe.run(startup_program)
train_iter = reader.multiprocess_graph_reader(
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
val_iter = reader.multiprocess_graph_reader(
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
test_iter = reader.multiprocess_graph_reader(
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
for epoch in range(args.epoch):
run_epoch(
train_iter,
program=train_program,
exe=exe,
prefix="train",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=1,
epoch=epoch)
run_epoch(
val_iter,
program=test_program,
exe=exe,
prefix="val",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
run_epoch(
test_iter,
program=test_program,
prefix="test",
exe=exe,
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='graphsage')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
"--normalize", action='store_true', help="normalize features")
parser.add_argument(
"--symmetry", action='store_true', help="undirect graph")
parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
parser.add_argument("--sample_workers", type=int, default=10)
parser.add_argument("--epoch", type=int, default=10)
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--lr", type=float, default=0.01)
parser.add_argument("--samples_1", type=int, default=25)
parser.add_argument("--samples_2", type=int, default=10)
args = parser.parse_args()
log.info(args)
main(args)
# Distributed metapath2vec, metapath2vec++, multi-metapath2vec++ in PGL
[metapath2vec](https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf) is a algorithm framework for representation learning in heterogeneous networks which contains multiple types of nodes and links. Given a heterogeneous graph, metapath2vec algorithm first generates meta-path-based random walks and then use skipgram model to train a language model. Based on PGL, we reproduce metapath2vec algorithm in distributed mode.
### Datasets
DBLP: The dataset contains 14376 papers (P), 20 conferences (C), 14475 authors (A), and 8920 terms (T). There are 33791 nodes in this dataset.
You can dowload datasets from [here](https://github.com/librahu/HIN-Datasets-for-Recommendation-and-Network-Embedding)
We use the ```DBLP``` dataset for example. After downloading the dataset, put them, let's say, in ```./data/DBLP/``` .
### Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
### How to run
Before training, run the below command to do data preprocessing.
```sh
python data_process.py --data_path ./data/DBLP --output_path ./data/data_processed
```
We adopt [PaddlePaddle Fleet](https://github.com/PaddlePaddle/Fleet) as our distributed training frameworks. ```config.yaml``` is a configure file for metapath2vec hyperparameters and ```local_config``` is a configure file for parameter servers of PaddlePaddle. By default, we have 2 pservers and 2 trainers. One can use ```cloud_run.sh``` to help startup the parameter servers and model trainers.
For examples, train metapath2vec in distributed mode on DBLP dataset.
```sh
# train metapath2vec in distributed mode.
sh cloud_run.sh
# multiclass task example
python multi_class.py --dataset ./data/data_processed/author_label.txt --ckpt_path ./checkpoints/2000 --num_nodes 33791
```
### Model Selection
Actually, There are 3 models in this example, they are ```metapath2vec```, ```metapath2vec++``` and ```multi_metapath2vec++```. You can select different models by modifying ```config.yaml```.
In order to run ```metapath2vec++``` model, you can easily rewrite the hyper parameter of **neg_sample_type** to **m2v_plus**, then ```metapath2vec++``` model will be selected.
```multi-metapath2vec++``` means that you are not only use a single metapath, instead, you can use several metapaths at the same time to train the model. For example, you might want to use ```c2p-p2a-a2p-p2c``` and ```p2a-a2p``` simultaneously. Then you can rewrite the below hyper parameters in ```config.yaml```.
- **neg_sample_type**: "m2v_plus"
- **walk_mode**: "multi_m2v"
- **meta_path**: "c2p-p2a-a2p-p2c;p2a-a2p"
- **first_node_type**: "c;p"
### Hyperparameters
All the hyper parameters are saved in ```config.yaml``` file. So before training, you can open the config.yaml to modify the hyper parameters as you like.
Some important hyper parameters in ```config.yaml```:
- **edge_path**: the directory of graph data that you want to load
- **lr**: learning rate
- **neg_num**: number of negative samples.
- **num_walks**: number of walks started from each node
- **walk_len**: walk length
- **meta_path**: meta path scheme
#!/bin/bash
set -x
mode=${1}
source ./utils.sh
unset http_proxy https_proxy
source ./local_config
if [ ! -d ${log_dir} ]; then
mkdir ${log_dir}
fi
for((i=0;i<${PADDLE_PSERVERS_NUM};i++))
do
echo "start ps server: ${i}"
echo $log_dir
TRAINING_ROLE="PSERVER" PADDLE_TRAINER_ID=${i} sh job.sh &> $log_dir/pserver.$i.log &
done
sleep 10s
for((j=0;j<${PADDLE_TRAINERS_NUM};j++))
do
echo "start ps work: ${j}"
TRAINING_ROLE="TRAINER" PADDLE_TRAINER_ID=${j} sh job.sh &> $log_dir/worker.$j.log &
done
tail -f $log_dir/worker.0.log
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import time
import os
import math
import numpy as np
import paddle.fluid as F
import paddle.fluid.layers as L
from paddle.fluid.incubate.fleet.parameter_server.distribute_transpiler import fleet
from paddle.fluid.transpiler.distribute_transpiler import DistributeTranspilerConfig
import paddle.fluid.incubate.fleet.base.role_maker as role_maker
from pgl.utils.logger import log
from model import Metapath2vecModel
from graph import m2vGraph
from utils import load_config
from walker import multiprocess_data_generator
def init_role():
# reset the place according to role of parameter server
training_role = os.getenv("TRAINING_ROLE", "TRAINER")
paddle_role = role_maker.Role.WORKER
place = F.CPUPlace()
if training_role == "PSERVER":
paddle_role = role_maker.Role.SERVER
# set the fleet runtime environment according to configure
ports = os.getenv("PADDLE_PORT", "6174").split(",")
pserver_ips = os.getenv("PADDLE_PSERVERS").split(",") # ip,ip...
eplist = []
if len(ports) > 1:
# local debug mode, multi port
for port in ports:
eplist.append(':'.join([pserver_ips[0], port]))
else:
# distributed mode, multi ip
for ip in pserver_ips:
eplist.append(':'.join([ip, ports[0]]))
pserver_endpoints = eplist # ip:port,ip:port...
worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
role = role_maker.UserDefinedRoleMaker(
current_id=trainer_id,
role=paddle_role,
worker_num=worker_num,
server_endpoints=pserver_endpoints)
fleet.init(role)
def optimization(base_lr, loss, train_steps, optimizer='sgd'):
decayed_lr = L.learning_rate_scheduler.polynomial_decay(
learning_rate=base_lr,
decay_steps=train_steps,
end_learning_rate=0.0001 * base_lr,
power=1.0,
cycle=False)
if optimizer == 'sgd':
optimizer = F.optimizer.SGD(decayed_lr)
elif optimizer == 'adam':
optimizer = F.optimizer.Adam(decayed_lr, lazy_mode=True)
else:
raise ValueError
log.info('learning rate:%f' % (base_lr))
#create the DistributeTranspiler configure
config = DistributeTranspilerConfig()
config.sync_mode = False
#config.runtime_split_send_recv = False
config.slice_var_up = False
#create the distributed optimizer
optimizer = fleet.distributed_optimizer(optimizer, config)
optimizer.minimize(loss)
def build_complied_prog(train_program, model_loss):
num_threads = int(os.getenv("CPU_NUM", 10))
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", 0))
exec_strategy = F.ExecutionStrategy()
exec_strategy.num_threads = num_threads
#exec_strategy.use_experimental_executor = True
build_strategy = F.BuildStrategy()
build_strategy.enable_inplace = True
#build_strategy.memory_optimize = True
build_strategy.memory_optimize = False
build_strategy.remove_unnecessary_lock = False
if num_threads > 1:
build_strategy.reduce_strategy = F.BuildStrategy.ReduceStrategy.Reduce
compiled_prog = F.compiler.CompiledProgram(
train_program).with_data_parallel(loss_name=model_loss.name)
return compiled_prog
def train_prog(exe, program, loss, node2vec_pyreader, args, train_steps):
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
step = 0
if not os.path.exists(args.save_path):
os.makedirs(args.save_path)
while True:
try:
begin_time = time.time()
loss_val, = exe.run(program, fetch_list=[loss])
log.info("step %s: loss %.5f speed: %.5f s/step" %
(step, np.mean(loss_val), time.time() - begin_time))
step += 1
except F.core.EOFException:
node2vec_pyreader.reset()
if step % args.steps_per_save == 0 or step == train_steps:
save_path = args.save_path
if trainer_id == 0:
model_path = os.path.join(save_path, "%s" % step)
fleet.save_persistables(exe, model_path)
if step == train_steps:
break
def main(args):
log.info("start")
worker_num = int(os.getenv("PADDLE_TRAINERS_NUM", "0"))
num_devices = int(os.getenv("CPU_NUM", 10))
model = Metapath2vecModel(config=args)
pyreader = model.pyreader
loss = model.forward()
# init fleet
init_role()
train_steps = math.ceil(args.num_nodes * args.epochs / args.batch_size /
num_devices / worker_num)
log.info("Train step: %s" % train_steps)
real_batch_size = args.batch_size * args.walk_len * args.win_size
if args.optimizer == "sgd":
args.lr *= real_batch_size
optimization(args.lr, loss, train_steps, args.optimizer)
# init and run server or worker
if fleet.is_server():
fleet.init_server(args.warm_start_from_dir)
fleet.run_server()
if fleet.is_worker():
log.info("start init worker done")
fleet.init_worker()
#just the worker, load the sample
log.info("init worker done")
exe = F.Executor(F.CPUPlace())
exe.run(fleet.startup_program)
log.info("Startup done")
dataset = m2vGraph(args)
log.info("Build graph done.")
data_generator = multiprocess_data_generator(args, dataset)
cur_time = time.time()
for idx, _ in enumerate(data_generator()):
log.info("iter %s: %s s" % (idx, time.time() - cur_time))
cur_time = time.time()
if idx == 100:
break
pyreader.decorate_tensor_provider(data_generator)
pyreader.start()
compiled_prog = build_complied_prog(fleet.main_program, loss)
train_prog(exe, compiled_prog, loss, pyreader, args, train_steps)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='metapath2vec')
parser.add_argument("-c", "--config", type=str, default="./config.yaml")
args = parser.parse_args()
config = load_config(args.config)
log.info(config)
main(config)
# graph data config
edge_path: "./data/data_processed"
edge_files: "p2a:paper_author.txt,p2c:paper_conference.txt,p2t:paper_type.txt"
node_types_file: "node_types.txt"
num_nodes: 37791
symmetry: True
# skipgram pair data config
win_size: 5
neg_num: 5
# average; m2v_plus
neg_sample_type: "average"
# random walk config
# m2v; multi_m2v;
walk_mode: "m2v"
meta_path: "c2p-p2a-a2p-p2c"
first_node_type: "c"
walk_len: 24
batch_size: 4
node_shuffle: True
node_files: null
num_sample_workers: 2
# model config
embed_dim: 64
is_sparse: True
# only use when num_nodes > 100,000,000, slower than noraml embedding
is_distributed: False
# trainging config
epochs: 10
optimizer: "sgd"
lr: 0.1
warm_start_from_dir: null
walkpath_files: "None"
train_files: "None"
steps_per_save: 1000
save_path: "./checkpoints"
log_dir: "./logs"
CPU_NUM: 16
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Data preprocessing for DBLP dataset"""
import sys
import os
import argparse
import numpy as np
from collections import OrderedDict
AUTHOR = 14475
PAPER = 14376
CONF = 20
TYPE = 8920
LABEL = 4
def build_node_types(meta_node, outfile):
"""build_node_types"""
nt_ori2new = {}
with open(outfile, 'w') as writer:
offset = 0
for node_type, num_nodes in meta_node.items():
ori_id2new_id = {}
for i in range(num_nodes):
writer.write("%d\t%s\n" % (offset + i, node_type))
ori_id2new_id[i + 1] = offset + i
nt_ori2new[node_type] = ori_id2new_id
offset += num_nodes
return nt_ori2new
def remapping_index(args, src_dict, dst_dict, ori_file, new_file):
"""remapping_index"""
ori_file = os.path.join(args.data_path, ori_file)
new_file = os.path.join(args.output_path, new_file)
with open(ori_file, 'r') as reader, open(new_file, 'w') as writer:
for line in reader:
slots = line.strip().split()
s = int(slots[0])
d = int(slots[1])
new_s = src_dict[s]
new_d = dst_dict[d]
writer.write("%d\t%d\n" % (new_s, new_d))
def author_label(args, ori_id2pgl_id, ori_file, real_file, new_file):
"""author_label"""
ori_file = os.path.join(args.data_path, ori_file)
real_file = os.path.join(args.data_path, real_file)
new_file = os.path.join(args.output_path, new_file)
real_id2pgl_id = {}
with open(ori_file, 'r') as reader:
for line in reader:
slots = line.strip().split()
ori_id = int(slots[0])
real_id = int(slots[1])
pgl_id = ori_id2pgl_id[ori_id]
real_id2pgl_id[real_id] = pgl_id
with open(real_file, 'r') as reader, open(new_file, 'w') as writer:
for line in reader:
slots = line.strip().split()
real_id = int(slots[0])
label = int(slots[1])
pgl_id = real_id2pgl_id[real_id]
writer.write("%d\t%d\n" % (pgl_id, label))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='DBLP data preprocessing')
parser.add_argument(
'--data_path',
default=None,
type=str,
help='original data path(default: None)')
parser.add_argument(
'--output_path',
default=None,
type=str,
help='output path(default: None)')
args = parser.parse_args()
meta_node = OrderedDict()
meta_node['a'] = AUTHOR
meta_node['p'] = PAPER
meta_node['c'] = CONF
meta_node['t'] = TYPE
if not os.path.exists(args.output_path):
os.makedirs(args.output_path)
node_types_file = os.path.join(args.output_path, "node_types.txt")
nt_ori2new = build_node_types(meta_node, node_types_file)
remapping_index(args, nt_ori2new['p'], nt_ori2new['a'], 'paper_author.dat',
'paper_author.txt')
remapping_index(args, nt_ori2new['p'], nt_ori2new['c'],
'paper_conference.dat', 'paper_conference.txt')
remapping_index(args, nt_ori2new['p'], nt_ori2new['t'], 'paper_type.dat',
'paper_type.txt')
author_label(args, nt_ori2new['a'], 'author_map_id.dat',
'author_label.dat', 'author_label.txt')
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import sys
import os
import numpy as np
import pickle as pkl
import tqdm
import time
import random
from pgl.utils.logger import log
from pgl import heter_graph
class m2vGraph(object):
"""Implemetation of graph in order to sample metapath random walk.
"""
def __init__(self, config):
self.edge_path = config.edge_path
self.num_nodes = config.num_nodes
self.symmetry = config.symmetry
edge_files = config.edge_files
node_types_file = config.node_types_file
self.edge_file_list = []
for pair in edge_files.split(','):
e_type, filename = pair.split(':')
filename = os.path.join(self.edge_path, filename)
self.edge_file_list.append((e_type, filename))
self.node_types_file = os.path.join(self.edge_path, node_types_file)
self.build_graph()
def build_graph(self):
"""Build pgl heterogeneous graph.
"""
edges_by_types = {}
npy = self.edge_file_list[0][1] + ".npy"
if os.path.exists(npy):
log.info("load data from numpy file")
for pair in self.edge_file_list:
edges_by_types[pair[0]] = np.load(pair[1] + ".npy")
else:
log.info("load data from txt file")
for pair in self.edge_file_list:
edges_by_types[pair[0]] = self.load_edges(pair[1])
# np.save(pair[1] + ".npy", edges_by_types[pair[0]])
for e_type, edges in edges_by_types.items():
log.info(["number of %s edges: " % e_type, len(edges)])
if self.symmetry:
tmp = {}
for key, edges in edges_by_types.items():
n_list = key.split('2')
re_key = n_list[1] + '2' + n_list[0]
tmp[re_key] = edges_by_types[key][:, [1, 0]]
edges_by_types.update(tmp)
log.info(["finished loadding symmetry edges."])
node_types = self.load_node_types(self.node_types_file)
assert len(node_types) == self.num_nodes, \
"num_nodes should be equal to the length of node_types"
log.info(["number of nodes: ", len(node_types)])
node_features = {
'index': np.array([i for i in range(self.num_nodes)]).reshape(
-1, 1).astype(np.int64)
}
self.graph = heter_graph.HeterGraph(
num_nodes=self.num_nodes,
edges=edges_by_types,
node_types=node_types,
node_feat=node_features)
def load_edges(self, file_, symmetry=False):
"""Load edges from file.
"""
edges = []
with open(file_, 'r') as reader:
for line in reader:
items = line.strip().split()
src, dst = int(items[0]), int(items[1])
edges.append((src, dst))
if symmetry:
edges.append((dst, src))
edges = np.array(list(set(edges)), dtype=np.int64)
# edges = list(set(edges))
return edges
def load_node_types(self, file_):
"""Load node types
"""
node_types = []
log.info("node_types_file name: %s" % file_)
with open(file_, 'r') as reader:
for line in reader:
items = line.strip().split()
node_id = int(items[0])
n_type = items[1]
node_types.append((node_id, n_type))
return node_types
#!/bin/bash
set -x
source ./utils.sh
export CPU_NUM=$CPU_NUM
export FLAGS_rpc_deadline=3000000
export FLAGS_communicator_send_queue_size=1
export FLAGS_communicator_min_send_grad_num_before_recv=0
export FLAGS_communicator_max_merge_var_num=1
export FLAGS_communicator_merge_sparse_grad=0
python -u cluster_train.py -c config.yaml
#!/bin/bash
export PADDLE_TRAINERS_NUM=2
export PADDLE_PSERVERS_NUM=2
export PADDLE_PORT=6184,6185
export PADDLE_PSERVERS="127.0.0.1"
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
metapath2vec model.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import math
import paddle.fluid.layers as L
import paddle.fluid as F
def distributed_embedding(input,
dict_size,
hidden_size,
initializer,
name,
num_part=16,
is_sparse=False,
learning_rate=1.0):
_part_size = hidden_size // num_part
if hidden_size % num_part != 0:
_part_size += 1
output_embedding = []
p_num = 0
while hidden_size > 0:
_part_size = min(_part_size, hidden_size)
hidden_size -= _part_size
print("part", p_num, "size=", (dict_size, _part_size))
part_embedding = L.embedding(
input=input,
size=(dict_size, int(_part_size)),
is_sparse=is_sparse,
is_distributed=False,
param_attr=F.ParamAttr(
name=name + '_part%s' % p_num,
initializer=initializer,
learning_rate=learning_rate))
p_num += 1
output_embedding.append(part_embedding)
return L.concat(output_embedding, -1)
class Metapath2vecModel(object):
def __init__(self, config, embedding_lr=1.0):
self.config = config
self.neg_num = self.config.neg_num
self.num_nodes = self.config.num_nodes
self.embed_dim = self.config.embed_dim
self.is_sparse = self.config.is_sparse
self.is_distributed = self.config.is_distributed
self.embedding_lr = embedding_lr
self.pyreader = L.py_reader(
capacity=70,
shapes=[[-1, 1, 1], [-1, self.neg_num + 1, 1]],
dtypes=['int64', 'int64'],
lod_levels=[0, 0],
name='train',
use_double_buffer=True)
bound = 1. / math.sqrt(self.embed_dim)
self.embed_init = F.initializer.Uniform(low=-bound, high=bound)
self.loss = None
max_hidden_size = int(math.pow(2, 31) / 4 / self.num_nodes)
self.num_part = int(math.ceil(1. * self.embed_dim / max_hidden_size))
def forward(self):
src, dsts = L.read_file(self.pyreader)
if self.is_sparse:
src = L.reshape(src, [-1, 1])
dsts = L.reshape(dsts, [-1, 1])
if self.num_part is not None and self.num_part != 1 and not self.is_distributed:
src_embed = distributed_embedding(
src,
self.num_nodes,
self.embed_dim,
self.embed_init,
"weight",
self.num_part,
self.is_sparse,
learning_rate=self.embedding_lr)
dsts_embed = distributed_embedding(
dsts,
self.num_nodes,
self.embed_dim,
self.embed_init,
"weight",
self.num_part,
self.is_sparse,
learning_rate=self.embedding_lr)
else:
src_embed = L.embedding(
src, (self.num_nodes, self.embed_dim),
self.is_sparse,
self.is_distributed,
param_attr=F.ParamAttr(
name="weight",
learning_rate=self.embedding_lr,
initializer=self.embed_init))
dsts_embed = L.embedding(
dsts, (self.num_nodes, self.embed_dim),
self.is_sparse,
self.is_distributed,
param_attr=F.ParamAttr(
name="weight",
learning_rate=self.embedding_lr,
initializer=self.embed_init))
if self.is_sparse:
src_embed = L.reshape(src_embed, [-1, 1, self.embed_dim])
dsts_embed = L.reshape(dsts_embed,
[-1, self.neg_num + 1, self.embed_dim])
logits = L.matmul(
src_embed, dsts_embed,
transpose_y=True) # [batch_size, 1, neg_num+1]
pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", 1)
neg_label = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 0)
label = L.concat([pos_label, neg_label], -1)
pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", self.neg_num)
neg_weight = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 1)
weight = L.concat([pos_weight, neg_weight], -1)
weight.stop_gradient = True
label.stop_gradient = True
loss = L.sigmoid_cross_entropy_with_logits(logits, label)
loss = loss * weight
loss = L.reduce_mean(loss)
loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
loss.persistable = True
self.loss = loss
return loss
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimized Multiprocessing Reader for PaddlePaddle
"""
import multiprocessing
import numpy as np
import time
import paddle.fluid as fluid
import pyarrow
def _serialize_serializable(obj):
"""Serialize Feed Dict
"""
return {"type": type(obj), "data": obj.__dict__}
def _deserialize_serializable(obj):
"""Deserialize Feed Dict
"""
val = obj["type"].__new__(obj["type"])
val.__dict__.update(obj["data"])
return val
context = pyarrow.default_serialization_context()
context.register_type(
object,
"object",
custom_serializer=_serialize_serializable,
custom_deserializer=_deserialize_serializable)
def serialize_data(data):
"""serialize_data"""
return pyarrow.serialize(data, context=context).to_buffer().to_pybytes()
def deserialize_data(data):
"""deserialize_data"""
return pyarrow.deserialize(data, context=context)
def multiprocess_reader(readers, use_pipe=True, queue_size=1000):
"""
multiprocess_reader use python multi process to read data from readers
and then use multiprocess.Queue or multiprocess.Pipe to merge all
data. The process number is equal to the number of input readers, each
process call one reader.
Multiprocess.Queue require the rw access right to /dev/shm, some
platform does not support.
you need to create multiple readers first, these readers should be independent
to each other so that each process can work independently.
An example:
.. code-block:: python
reader0 = reader(["file01", "file02"])
reader1 = reader(["file11", "file12"])
reader1 = reader(["file21", "file22"])
reader = multiprocess_reader([reader0, reader1, reader2],
queue_size=100, use_pipe=False)
"""
assert type(readers) is list and len(readers) > 0
def _read_into_queue(reader, queue):
"""read_into_queue"""
for sample in reader():
if sample is None:
raise ValueError("sample has None")
queue.put(serialize_data(sample))
queue.put(serialize_data(None))
def queue_reader():
"""queue_reader"""
queue = multiprocessing.Queue(queue_size)
for reader in readers:
p = multiprocessing.Process(
target=_read_into_queue, args=(reader, queue))
p.start()
reader_num = len(readers)
finish_num = 0
while finish_num < reader_num:
sample = deserialize_data(queue.get())
if sample is None:
finish_num += 1
else:
yield sample
def _read_into_pipe(reader, conn):
"""read_into_pipe"""
for sample in reader():
if sample is None:
raise ValueError("sample has None!")
conn.send(serialize_data(sample))
conn.send(serialize_data(None))
conn.close()
def pipe_reader():
"""pipe_reader"""
conns = []
for reader in readers:
parent_conn, child_conn = multiprocessing.Pipe()
conns.append(parent_conn)
p = multiprocessing.Process(
target=_read_into_pipe, args=(reader, child_conn))
p.start()
reader_num = len(readers)
finish_num = 0
conn_to_remove = []
finish_flag = np.zeros(len(conns), dtype="int32")
while finish_num < reader_num:
for conn_id, conn in enumerate(conns):
if finish_flag[conn_id] > 0:
continue
buff = conn.recv()
now = time.time()
sample = deserialize_data(buff)
out = time.time() - now
if sample is None:
finish_num += 1
conn.close()
finish_flag[conn_id] = 1
else:
yield sample
if use_pipe:
return pipe_reader
else:
return queue_reader
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file provides the multi class task for testing the embedding learned by metapath2vec model.
"""
import argparse
import sys
import os
import tqdm
import time
import math
import logging
import random
import pickle as pkl
import numpy as np
import sklearn.metrics
from sklearn.metrics import f1_score
import pgl
import paddle.fluid as fluid
import paddle.fluid.layers as fl
def load_data(file_):
"""Load data for node classification.
"""
words_label = []
line_count = 0
with open(file_, 'r') as reader:
for line in reader:
line_count += 1
tokens = line.strip().split()
word, label = int(tokens[0]), int(tokens[1]) - 1
words_label.append((word, label))
words_label = np.array(words_label, dtype=np.int64)
np.random.shuffle(words_label)
logging.info('%d/%d word_label pairs have been loaded' %
(len(words_label), line_count))
return words_label
def node_classify_model(config):
"""Build node classify model.
"""
nodes = fl.data('nodes', shape=[None, 1], dtype='int64')
labels = fl.data('labels', shape=[None, 1], dtype='int64')
embed_nodes = fl.embedding(
input=nodes,
size=[config.num_nodes, config.embed_dim],
param_attr=fluid.ParamAttr(name='weight'))
embed_nodes.stop_gradient = True
probs = fl.fc(input=embed_nodes, size=config.num_labels, act='softmax')
predict = fl.argmax(probs, axis=-1)
loss = fl.cross_entropy(input=probs, label=labels)
loss = fl.reduce_mean(loss)
return {
'loss': loss,
'probs': probs,
'predict': predict,
'labels': labels,
}
def run_epoch(exe, prog, model, feed_dict, lr):
"""Run training process of every epoch.
"""
if lr is None:
loss, predict = exe.run(prog,
feed=feed_dict,
fetch_list=[model['loss'], model['predict']],
return_numpy=True)
lr_ = 0
else:
loss, predict, lr_ = exe.run(
prog,
feed=feed_dict,
fetch_list=[model['loss'], model['predict'], lr],
return_numpy=True)
macro_f1 = f1_score(feed_dict['labels'], predict, average="macro")
micro_f1 = f1_score(feed_dict['labels'], predict, average="micro")
return {
'loss': loss,
'pred': predict,
'lr': lr_,
'macro_f1': macro_f1,
'micro_f1': micro_f1
}
def main(args):
"""main function for training node classification task.
"""
words_label = load_data(args.dataset)
# split data for training and testing
split_position = int(words_label.shape[0] * args.train_percent)
train_words_label = words_label[0:split_position, :]
test_words_label = words_label[split_position:, :]
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_prog = fluid.Program()
test_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
with fluid.unique_name.guard():
model = node_classify_model(args)
test_prog = train_prog.clone(for_test=True)
with fluid.program_guard(train_prog, startup_prog):
lr = fl.polynomial_decay(args.lr, 1000, 0.001)
adam = fluid.optimizer.Adam(lr)
adam.minimize(model['loss'])
exe = fluid.Executor(place)
exe.run(startup_prog)
def existed_params(var):
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(args.ckpt_path, var.name))
fluid.io.load_vars(
exe, args.ckpt_path, main_program=train_prog, predicate=existed_params)
# load_param(args.ckpt_path, ['content'])
feed_dict = {}
X = train_words_label[:, 0].reshape(-1, 1)
labels = train_words_label[:, 1].reshape(-1, 1)
logging.info('%d/%d data to train' %
(labels.shape[0], words_label.shape[0]))
test_feed_dict = {}
test_X = test_words_label[:, 0].reshape(-1, 1)
test_labels = test_words_label[:, 1].reshape(-1, 1)
logging.info('%d/%d data to test' %
(test_labels.shape[0], words_label.shape[0]))
for epoch in range(args.epochs):
feed_dict['nodes'] = X
feed_dict['labels'] = labels
train_result = run_epoch(exe, train_prog, model, feed_dict, lr)
test_feed_dict['nodes'] = test_X
test_feed_dict['labels'] = test_labels
test_result = run_epoch(exe, test_prog, model, test_feed_dict, lr=None)
logging.info(
'epoch %d | lr %.4f | train_loss %.5f | train_macro_F1 %.4f | train_micro_F1 %.4f | test_loss %.5f | test_macro_F1 %.4f | test_micro_F1 %.4f'
% (epoch, train_result['lr'], train_result['loss'],
train_result['macro_f1'], train_result['micro_f1'],
test_result['loss'], test_result['macro_f1'],
test_result['micro_f1']))
logging.info(
'final_test_macro_f1 score: %.4f | final_test_micro_f1 score: %.4f' %
(test_result['macro_f1'], test_result['micro_f1']))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='multi_class')
parser.add_argument(
'--dataset',
default=None,
type=str,
help='training and testing data file(default: None)')
parser.add_argument(
'--ckpt_path', default=None, type=str, help='task name(default: None)')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
'--train_percent',
default=0.5,
type=float,
help='train_percent(default: 0.5)')
parser.add_argument(
'--num_labels',
default=4,
type=int,
help='number of labels(default: 4)')
parser.add_argument(
'--epochs',
default=100,
type=int,
help='number of epochs for training(default: 100)')
parser.add_argument(
'--lr',
default=0.025,
type=float,
help='learning rate(default: 0.025)')
parser.add_argument(
'--num_nodes', default=0, type=int, help='number of nodes')
parser.add_argument(
'--embed_dim',
default=64,
type=int,
help='dimension of embedding(default: 64)')
args = parser.parse_args()
log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s'
logging.basicConfig(level='INFO', format=log_format)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Implementation of some helper functions"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import os
import time
import yaml
import numpy as np
from pgl.utils.logger import log
class AttrDict(dict):
"""Attr dict
"""
def __init__(self, d):
self.dict = d
def __getattr__(self, attr):
value = self.dict[attr]
if isinstance(value, dict):
return AttrDict(value)
else:
return value
def __str__(self):
return str(self.dict)
def load_config(config_file):
"""Load config file"""
with open(config_file) as f:
if hasattr(yaml, 'FullLoader'):
config = yaml.load(f, Loader=yaml.FullLoader)
else:
config = yaml.load(f)
return AttrDict(config)
# parse yaml file
function parse_yaml {
local prefix=$2
local s='[[:space:]]*' w='[a-zA-Z0-9_]*' fs=$(echo @|tr @ '\034')
sed -ne "s|^\($s\):|\1|" \
-e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
-e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" $1 |
awk -F$fs '{
indent = length($1)/2;
vname[indent] = $2;
for (i in vname) {if (i > indent) {delete vname[i]}}
if (length($3) > 0) {
vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
printf("%s%s%s=\"%s\"\n", "'$prefix'",vn, $2, $3);
}
}'
}
eval $(parse_yaml "$(dirname "${BASH_SOURCE}")"/config.yaml)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""doc
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
import time
import io
import os
import numpy as np
import random
from pgl.utils.logger import log
from pgl.sample import metapath_randomwalk
from pgl.graph_kernel import skip_gram_gen_pair
from pgl.graph_kernel import alias_sample_build_table
from utils import load_config
from graph import m2vGraph
import mp_reader
class NodeGenerator(object):
"""Node generator"""
def __init__(self, config, graph):
self.config = config
self.graph = graph
self.batch_size = self.config.batch_size
self.shuffle = self.config.node_shuffle
self.node_files = self.config.node_files
self.first_node_type = self.config.first_node_type
self.walk_mode = self.config.walk_mode
def __call__(self):
if self.walk_mode == "m2v":
generator = self.m2v_node_generate
log.info("node gen mode is : %s" % (self.walk_mode))
elif self.walk_mode == "multi_m2v":
generator = self.multi_m2v_node_generate
log.info("node gen mode is : %s" % (self.walk_mode))
elif self.walk_mode == "files":
generator = self.files_node_generate
log.info("node gen mode is : %s" % (self.walk_mode))
else:
generator = self.m2v_node_generate
log.info("node gen mode is : %s" % (self.walk_mode))
while True:
for nodes in generator():
yield nodes
def m2v_node_generate(self):
"""m2v_node_generate"""
for nodes in self.graph.node_batch_iter(
batch_size=self.batch_size,
n_type=self.first_node_type,
shuffle=self.shuffle):
yield nodes
def multi_m2v_node_generate(self):
"""multi_m2v_node_generate"""
n_type_list = self.first_node_type.split(';')
num_n_type = len(n_type_list)
node_types = np.unique(self.graph.node_types).tolist()
node_generators = {}
for n_type in node_types:
node_generators[n_type] = \
self.graph.node_batch_iter(self.batch_size, n_type=n_type)
cc = 0
while True:
idx = cc % num_n_type
n_type = n_type_list[idx]
try:
nodes = next(node_generators[n_type])
except StopIteration as e:
log.info("node type of %s iteration finished in one epoch" %
(n_type))
node_generators[n_type] = \
self.graph.node_batch_iter(self.batch_size, n_type=n_type)
break
yield (nodes, idx)
cc += 1
def files_node_generate(self):
"""files_node_generate"""
nodes = []
for filename in self.node_files:
with io.open(filename) as inf:
for line in inf:
node = int(line.strip('\n\t'))
nodes.append(node)
if len(nodes) == self.batch_size:
yield nodes
nodes = []
if len(nodes):
yield nodes
class WalkGenerator(object):
"""Walk generator"""
def __init__(self, config, dataset):
self.config = config
self.dataset = dataset
self.graph = self.dataset.graph
self.walk_mode = self.config.walk_mode
self.node_generator = NodeGenerator(self.config, self.graph)
if self.walk_mode == "multi_m2v":
num_path = len(self.config.meta_path.split(';'))
num_first_node_type = len(self.config.first_node_type.split(';'))
assert num_first_node_type == num_path, \
"In [multi_m2v] walk_mode, the number of metapath should be the same \
as the number of first_node_type"
assert num_path > 1, "In [multi_m2v] walk_mode, the number of metapath\
should be greater than 1"
def __call__(self):
np.random.seed(os.getpid())
if self.walk_mode == "m2v":
walk_generator = self.m2v_walk
log.info("walk mode is : %s" % (self.walk_mode))
elif self.walk_mode == "multi_m2v":
walk_generator = self.multi_m2v_walk
log.info("walk mode is : %s" % (self.walk_mode))
else:
raise ValueError("walk_mode [%s] is not matched" % self.walk_mode)
for walks in walk_generator():
yield walks
def m2v_walk(self):
"""Metapath2vec walker"""
for nodes in self.node_generator():
walks = metapath_randomwalk(
self.graph, nodes, self.config.meta_path, self.config.walk_len)
yield walks
def multi_m2v_walk(self):
"""Multi metapath2vec walker"""
meta_paths = self.config.meta_path.split(';')
for nodes, idx in self.node_generator():
walks = metapath_randomwalk(self.graph, nodes, meta_paths[idx],
self.config.walk_len)
yield walks
class DataGenerator(object):
def __init__(self, config, dataset):
self.config = config
self.dataset = dataset
self.graph = self.dataset.graph
self.walk_generator = WalkGenerator(self.config, self.dataset)
def __call__(self):
generator = self.pair_generate
for src, pos, negs in generator():
dst = np.concatenate([pos, negs], 1)
yield src, dst
def pair_generate(self):
for walks in self.walk_generator():
try:
src_list, pos_list = [], []
for walk in walks:
s, p = skip_gram_gen_pair(walk, self.config.win_size)
src_list.append(s), pos_list.append(p)
src = [s for x in src_list for s in x]
pos = [s for x in pos_list for s in x]
if len(src) == 0:
continue
negs = self.negative_sample(
src,
pos,
neg_num=self.config.neg_num,
neg_sample_type=self.config.neg_sample_type)
src = np.array(src, dtype=np.int64).reshape(-1, 1, 1)
pos = np.array(pos, dtype=np.int64).reshape(-1, 1, 1)
yield src, pos, negs
except Exception as e:
log.exception(e)
def negative_sample(self, src, pos, neg_num, neg_sample_type):
if neg_sample_type == "average":
neg_sample_size = [len(pos), neg_num, 1]
negs = np.random.randint(
low=0, high=self.graph.num_nodes, size=neg_sample_size)
elif neg_sample_type == "m2v_plus":
negs = []
for s in src:
neg = self.graph.sample_nodes(
sample_num=neg_num, n_type=self.graph.node_types[s])
negs.append(neg)
negs = np.vstack(negs).reshape(-1, neg_num, 1)
else: # equal to "average"
neg_sample_size = [len(pos), neg_num, 1]
negs = np.random.randint(
low=0, high=self.graph.num_nodes, size=neg_sample_size)
negs = negs.astype(np.int64)
return negs
def multiprocess_data_generator(config, dataset):
"""Multiprocess data generator.
"""
if config.num_sample_workers == 1:
data_generator = DataGenerator(config, dataset)
else:
pool = [
DataGenerator(config, dataset)
for i in range(config.num_sample_workers)
]
data_generator = mp_reader.multiprocess_reader(
pool, use_pipe=True, queue_size=100)
return data_generator
if __name__ == "__main__":
config_file = "./config.yaml"
config = load_config(config_file)
dataset = m2vGraph(config)
data_generator = multiprocess_data_generator(config, dataset)
start = time.time()
cc = 0
for src, dst in data_generator():
log.info(src.shape)
log.info("time: %.6f" % (time.time() - start))
start = time.time()
cc += 1
if cc == 100:
break
# PGL Examples for GAT
# GAT: Graph Attention Networks
[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks.
### Simple example to build single head GAT
......@@ -26,24 +26,25 @@ def gat_layer(graph_wrapper, node_feature, hidden_size):
return output
```
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~83% | 0.0188s | 0.0175s |
| Pubmed | ~78% | 0.0449s | 0.0295s |
| Citeseer | ~70% | 0.0275 | 0.0253s |
| Dataset | Accuracy |
| --- | --- |
| Cora | ~83% |
| Pubmed | ~78% |
| Citeseer | ~70% |
### How to run
......
......@@ -68,7 +68,7 @@ def main(args):
node_index = fluid.layers.data(
"node_index",
shape=[None, 1],
dtype="int32",
dtype="int64",
append_batch_size=False)
node_label = fluid.layers.data(
"node_label",
......@@ -111,7 +111,7 @@ def main(args):
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
feed_dict["node_index"] = np.array(train_index, dtype="int32")
feed_dict["node_index"] = np.array(train_index, dtype="int64")
feed_dict["node_label"] = np.array(train_label, dtype="int64")
train_loss, train_acc = exe.run(train_program,
feed=feed_dict,
......@@ -121,7 +121,7 @@ def main(args):
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
feed_dict["node_index"] = np.array(val_index, dtype="int32")
feed_dict["node_index"] = np.array(val_index, dtype="int64")
feed_dict["node_label"] = np.array(val_label, dtype="int64")
val_loss, val_acc = exe.run(test_program,
feed=feed_dict,
......@@ -132,7 +132,7 @@ def main(args):
"Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+ "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
feed_dict["node_index"] = np.array(test_index, dtype="int32")
feed_dict["node_index"] = np.array(test_index, dtype="int64")
feed_dict["node_label"] = np.array(test_label, dtype="int64")
test_loss, test_acc = exe.run(test_program,
feed=feed_dict,
......
# PGL Examples for GCN
# GCN: Graph Convolutional Networks
[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks.
......@@ -26,18 +26,18 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | --- |---|
| Cora | ~81% | 0.0106s | 0.0104s |
| Pubmed | ~79% | 0.0210s | 0.0154s |
| Citeseer | ~71% | 0.0175s | 0.0177s |
| Dataset | Accuracy |
| --- | --- |
| Cora | ~81% |
| Pubmed | ~79% |
| Citeseer | ~71% |
### How to run
......
......@@ -70,7 +70,7 @@ def main(args):
node_index = fluid.layers.data(
"node_index",
shape=[None, 1],
dtype="int32",
dtype="int64",
append_batch_size=False)
node_label = fluid.layers.data(
"node_label",
......@@ -113,7 +113,7 @@ def main(args):
for epoch in range(200):
if epoch >= 3:
t0 = time.time()
feed_dict["node_index"] = np.array(train_index, dtype="int32")
feed_dict["node_index"] = np.array(train_index, dtype="int64")
feed_dict["node_label"] = np.array(train_label, dtype="int64")
train_loss, train_acc = exe.run(train_program,
feed=feed_dict,
......@@ -123,7 +123,7 @@ def main(args):
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
feed_dict["node_index"] = np.array(val_index, dtype="int32")
feed_dict["node_index"] = np.array(val_index, dtype="int64")
feed_dict["node_label"] = np.array(val_label, dtype="int64")
val_loss, val_acc = exe.run(test_program,
feed=feed_dict,
......@@ -134,7 +134,7 @@ def main(args):
"Train Loss: %f " % train_loss + "Train Acc: %f " % train_acc
+ "Val Loss: %f " % val_loss + "Val Acc: %f " % val_acc)
feed_dict["node_index"] = np.array(test_index, dtype="int32")
feed_dict["node_index"] = np.array(test_index, dtype="int64")
feed_dict["node_label"] = np.array(test_label, dtype="int64")
test_loss, test_acc = exe.run(test_program,
feed=feed_dict,
......
# GES: Graph Embedding with Side Information
[Graph Embedding with Side Information](https://arxiv.org/pdf/1803.02349.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce ges algorithms.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3).
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
## How to run
For examples, train ges on cora dataset.
```sh
# train deepwalk in distributed mode.
sh gpu_run.sh
```
## Hyperparameters
- dataset: The citation dataset "BlogCatalog".
- hidden_size: Hidden size of the embedding.
- lr: Learning rate.
- neg_num: Number of negative samples.
- epoch: Number of training epoch.
#!/bin/bash
export FLAGS_sync_nccl_allreduce=1
export FLAGS_eager_delete_tensor_gb=0
export FLAGS_fraction_of_gpu_memory_to_use=1
export NCCL_DEBUG=INFO
export NCCL_IB_GID_INDEX=3
export GLOG_v=1
export GLOG_logtostderr=1
num_nodes=10312
num_embedding=10351
num_sample_workers=20
# build train_data
rm -rf train_data && mkdir -p train_data
cd train_data
seq 0 $((num_nodes-1)) | shuf | split -l $((num_nodes/num_sample_workers+1))
cd -
python3 gpu_train.py --output_path ./output --epoch 100 --walk_len 40 --win_size 5 --neg_num 5 --batch_size 128 --hidden_size 128 \
--num_nodes $num_nodes --num_embedding $num_embedding --num_sample_workers $num_sample_workers --steps_per_save 2000 --dataset "BlogCatalog"
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" gpu_train
"""
import argparse
import time
import os
import glob
import numpy as np
import paddle.fluid as F
import paddle.fluid.layers as L
from pgl.utils.logger import log
from pgl.graph import Graph
from pgl.sample import graph_alias_sample_table
from pgl import data_loader
import mp_reader
from reader import GESReader
from model import GESModel
def get_file_list(path):
"""get_file_list
"""
filelist = []
if os.path.isfile(path):
filelist = [path]
elif os.path.isdir(path):
filelist = [
os.path.join(dp, f)
for dp, dn, filenames in os.walk(path) for f in filenames
]
else:
raise ValueError(path + " not supported")
return filelist
def build_graph(num_nodes, edge_path, output_path, undigraph=True):
""" build_graph
"""
edge_file = os.path.join(output_path, "edge.npy")
edge_weight_file = os.path.join(output_path, "edge_weight.npy")
alias_file = os.path.join(output_path, "alias.npy")
events_file = os.path.join(output_path, "events.npy")
if os.path.isfile(edge_file):
edges = np.load(edge_file)
edge_feat = dict()
if os.path.isfile(edge_weight_file):
log.info("Loading weight from cache")
edge_feat["weight"] = np.load(edge_weight_file, allow_pickle=True)
node_feat = dict()
if os.path.isfile(alias_file):
log.info("Loading alias from cache")
node_feat["alias"] = np.load(alias_file, allow_pickle=True)
if os.path.isfile(events_file):
log.info("Loading events from cache")
node_feat["events"] = np.load(events_file, allow_pickle=True)
else:
filelist = get_file_list(edge_path)
edges, edge_weight = [], []
log.info("Reading edge files")
for name in filelist:
with open(name) as inf:
for line in inf:
slots = line.strip("\n").split()
edges.append([slots[0], slots[1]])
if len(slots) > 2:
edge_weight.append(slots[2])
edges = np.array(edges, dtype="int64")
assert num_nodes > edges.max(
), "Node id in any edges should be smaller then num_nodes!"
log.info("Read edge files done.")
edge_feat = dict()
node_feat = dict()
if len(edge_weight) == len(edges):
edge_feat["weight"] = np.array(edge_weight, dtype="float32")
if undigraph is True:
edges = np.concatenate([edges, edges[:, [1, 0]]], 0)
if "weight" in edge_feat:
edge_feat["weight"] = np.concatenate(
[edge_feat["weight"], edge_feat["weight"]],
0).astype("float64")
graph = Graph(num_nodes, edges, node_feat, edge_feat=edge_feat)
log.info("Build graph done")
graph.outdegree()
log.info("Build graph index done")
if "weight" in graph.edge_feat and "alias" not in graph.node_feat and "events" not in graph.node_feat:
graph.node_feat["alias"], graph.node_feat[
"events"] = graph_alias_sample_table(graph, "weight")
log.info(
"Build graph alias sample table done, and saving alias & evnets cache"
)
np.save(alias_file, graph.node_feat["alias"])
np.save(events_file, graph.node_feat["events"])
return graph
def optimization(base_lr, loss, train_steps, optimizer='adam'):
""" optimization
"""
decayed_lr = L.polynomial_decay(base_lr, train_steps, 0.0001)
if optimizer == 'sgd':
optimizer = F.optimizer.SGD(
decayed_lr,
regularization=F.regularizer.L2DecayRegularizer(
regularization_coeff=0.0025))
elif optimizer == 'adam':
# dont use gpu's lazy mode
optimizer = F.optimizer.Adam(decayed_lr)
else:
raise ValueError
log.info('learning rate:%f' % (base_lr))
optimizer.minimize(loss)
def build_gen_func(args, graph, node_feat):
""" build_gen_func
"""
num_sample_workers = args.num_sample_workers
if args.walkpath_files is None:
walkpath_files = [None for _ in range(num_sample_workers)]
else:
files = get_file_list(args.walkpath_files)
walkpath_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(files):
walkpath_files[idx % num_sample_workers].append(f)
if args.train_files is None:
train_files = [None for _ in range(num_sample_workers)]
else:
files = get_file_list(args.train_files)
train_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(files):
train_files[idx % num_sample_workers].append(f)
gen_func_pool = [
GESReader(
graph,
node_feat,
batch_size=args.batch_size,
walk_len=args.walk_len,
win_size=args.win_size,
neg_num=args.neg_num,
neg_sample_type=args.neg_sample_type,
walkpath_files=walkpath_files[i],
train_files=train_files[i]) for i in range(num_sample_workers)
]
if num_sample_workers == 1:
gen_func = gen_func_pool[0]
else:
gen_func = mp_reader.multiprocess_reader(
gen_func_pool, use_pipe=True, queue_size=100)
return gen_func
def get_parallel_exe(program, loss):
""" get_parallel_exe
"""
exec_strategy = F.ExecutionStrategy()
exec_strategy.num_threads = 1 #2 for fp32 4 for fp16
exec_strategy.use_experimental_executor = True
exec_strategy.num_iteration_per_drop_scope = 10 #important shit
build_strategy = F.BuildStrategy()
build_strategy.enable_inplace = True
build_strategy.memory_optimize = True
build_strategy.remove_unnecessary_lock = True
#return compiled_prog
train_exe = F.ParallelExecutor(
use_cuda=True,
loss_name=loss.name,
build_strategy=build_strategy,
exec_strategy=exec_strategy,
main_program=program)
return train_exe
def train(train_exe, exe, program, loss, node2vec_pyreader, args, train_steps):
""" train
"""
trainer_id = int(os.getenv("PADDLE_TRAINER_ID", "0"))
step = 0
while True:
try:
begin_time = time.time()
loss_val, = train_exe.run(fetch_list=[loss])
log.info("step %s: loss %.5f speed: %.5f s/step" %
(step, np.mean(loss_val), time.time() - begin_time))
step += 1
except F.core.EOFException:
node2vec_pyreader.reset()
if (step % args.steps_per_save == 0 or
step == train_steps) and trainer_id == 0:
model_save_dir = args.output_path
model_path = os.path.join(model_save_dir, str(step))
if not os.path.exists(model_save_dir):
os.makedirs(model_save_dir)
F.io.save_params(exe, model_path, program)
if step == train_steps:
break
def test_gen_speed(gen_func):
""" test_gen_speed
"""
cur_time = time.time()
for idx, _ in enumerate(gen_func()):
log.info("iter %s: %s s" % (idx, time.time() - cur_time))
cur_time = time.time()
if idx == 100:
break
def main(args):
""" main
"""
import logging
log.setLevel(logging.DEBUG)
log.info("start")
if args.dataset is not None:
if args.dataset == "BlogCatalog":
graph = data_loader.BlogCatalogDataset().graph
else:
raise ValueError(args.dataset + " dataset doesn't exists")
log.info("Load buildin BlogCatalog dataset done.")
node_feat = np.expand_dims(graph.node_feat["group_id"].argmax(-1),
-1) + graph.num_nodes
args.num_nodes = graph.num_nodes
args.num_embedding = graph.num_nodes + graph.node_feat[
"group_id"].shape[-1]
else:
graph = build_graph(args.num_nodes, args.edge_path, args.output_path)
node_feat = np.load(args.node_feat_npy)
model = GESModel(args.num_embedding, node_feat.shape[1] + 1,
args.hidden_size, args.neg_num, False, 2)
pyreader = model.pyreader
loss = model.forward()
num_devices = len(F.cuda_places())
train_steps = int(args.num_nodes * args.epoch / args.batch_size /
num_devices)
log.info("Train steps: %s" % train_steps)
optimization(args.lr * num_devices, loss, train_steps, args.optimizer)
place = F.CUDAPlace(0)
exe = F.Executor(place)
exe.run(F.default_startup_program())
gen_func = build_gen_func(args, graph, node_feat)
pyreader.decorate_tensor_provider(gen_func)
pyreader.start()
train_prog = F.default_main_program()
train_exe = get_parallel_exe(train_prog, loss)
train(train_exe, exe, train_prog, loss, pyreader, args, train_steps)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Deepwalk')
parser.add_argument("--hidden_size", type=int, default=64)
parser.add_argument("--lr", type=float, default=0.025)
parser.add_argument("--neg_num", type=int, default=5)
parser.add_argument("--epoch", type=int, default=100)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--walk_len", type=int, default=40)
parser.add_argument("--win_size", type=int, default=5)
parser.add_argument("--output_path", type=str, default="output")
parser.add_argument("--num_sample_workers", type=int, default=1)
parser.add_argument("--steps_per_save", type=int, default=3000)
parser.add_argument("--num_nodes", type=int, default=10000)
parser.add_argument("--num_embedding", type=int, default=10000)
parser.add_argument("--edge_path", type=str, default="./graph_data")
parser.add_argument("--walkpath_files", type=str, default=None)
parser.add_argument("--train_files", type=str, default="./train_data")
parser.add_argument("--node_feat_npy", type=str, default="./feat.npy")
parser.add_argument("--dataset", type=str, default=None)
parser.add_argument(
"--neg_sample_type",
type=str,
default="average",
choices=["average", "outdegree"])
parser.add_argument(
"--optimizer",
type=str,
required=False,
choices=['adam', 'sgd'],
default="adam")
args = parser.parse_args()
log.info(args)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
GES model file.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import math
import paddle.fluid.layers as L
import paddle.fluid as F
def split_embedding(input,
dict_size,
hidden_size,
initializer,
name,
num_part=16,
is_sparse=False,
learning_rate=1.0):
""" split_embedding
"""
_part_size = hidden_size // num_part
if hidden_size % num_part != 0:
_part_size += 1
output_embedding = []
p_num = 0
while hidden_size > 0:
_part_size = min(_part_size, hidden_size)
hidden_size -= _part_size
print("part", p_num, "size=", (dict_size, _part_size))
part_embedding = L.embedding(
input=input,
size=(dict_size, _part_size),
is_sparse=is_sparse,
is_distributed=False,
param_attr=F.ParamAttr(
name=name + '_part%s' % p_num,
initializer=initializer,
learning_rate=learning_rate))
p_num += 1
output_embedding.append(part_embedding)
return L.concat(output_embedding, -1)
class GESModel(object):
""" GESModel
"""
def __init__(self,
num_nodes,
num_featuers,
hidden_size=16,
neg_num=5,
is_sparse=False,
num_part=1):
self.pyreader = L.py_reader(
capacity=70,
shapes=[[-1, 1, num_featuers, 1],
[-1, neg_num + 1, num_featuers, 1]],
dtypes=['int64', 'int64'],
lod_levels=[0, 0],
name='train',
use_double_buffer=True)
self.num_nodes = num_nodes
self.num_featuers = num_featuers
self.neg_num = neg_num
self.embed_init = F.initializer.TruncatedNormal(scale=1.0 /
math.sqrt(hidden_size))
self.is_sparse = is_sparse
self.num_part = num_part
self.hidden_size = hidden_size
self.loss = None
def forward(self):
""" forward
"""
src, dst = L.read_file(self.pyreader)
if self.is_sparse:
# sparse mode use 2 dims input.
src = L.reshape(src, [-1, 1])
dst = L.reshape(dst, [-1, 1])
src_embed = split_embedding(src, self.num_nodes, self.hidden_size,
self.embed_init, "weight", self.num_part,
self.is_sparse)
dst_embed = split_embedding(dst, self.num_nodes, self.hidden_size,
self.embed_init, "weight", self.num_part,
self.is_sparse)
if self.is_sparse:
src_embed = L.reshape(
src_embed, [-1, 1, self.num_featuers, self.hidden_size])
dst_embed = L.reshape(
dst_embed,
[-1, self.neg_num + 1, self.num_featuers, self.hidden_size])
src_embed = L.reduce_mean(src_embed, 2)
dst_embed = L.reduce_mean(dst_embed, 2)
logits = L.matmul(
src_embed, dst_embed,
transpose_y=True) # [batch_size, 1, neg_num+1]
pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", 1)
neg_label = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 0)
label = L.concat([pos_label, neg_label], -1)
pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", self.neg_num)
neg_weight = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 1)
weight = L.concat([pos_weight, neg_weight], -1)
weight.stop_gradient = True
label.stop_gradient = True
loss = L.sigmoid_cross_entropy_with_logits(logits, label)
loss = loss * weight
loss = L.reduce_mean(loss)
loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
loss.persistable = True
self.loss = loss
return loss
class EGESModel(GESModel):
""" EGESModel
"""
def forward(self):
""" forward
"""
src, dst = L.read_file(self.pyreader)
src_id = L.slice(src, [0, 1, 2, 3], [0, 0, 0, 0],
[int(math.pow(2, 30)) - 1, 1, 1, 1])
dst_id = L.slice(dst, [0, 1, 2, 3], [0, 0, 0, 0],
[int(math.pow(2, 30)) - 1, self.neg_num + 1, 1, 1])
if self.is_sparse:
# sparse mode use 2 dims input.
src = L.reshape(src, [-1, 1])
dst = L.reshape(dst, [-1, 1])
# [b, 1, f, h]
src_embed = split_embedding(src, self.num_nodes, self.hidden_size,
self.embed_init, "weight", self.num_part,
self.is_sparse)
# [b, n+1, f, h]
dst_embed = split_embedding(dst, self.num_nodes, self.hidden_size,
self.embed_init, "weight", self.num_part,
self.is_sparse)
if self.is_sparse:
src_embed = L.reshape(
src_embed, [-1, 1, self.num_featuers, self.hidden_size])
dst_embed = L.reshape(
dst_embed,
[-1, self.neg_num + 1, self.num_featuers, self.hidden_size])
# [b, 1, 1, f]
src_weight = L.softmax(
L.embedding(
src_id, [self.num_nodes, self.num_featuers],
param_attr=F.ParamAttr(name="alpha")))
# [b, n+1, 1, f]
dst_weight = L.softmax(
L.embedding(
dst_id, [self.num_nodes, self.num_featuers],
param_attr=F.ParamAttr(name="alpha")))
# [b, 1, h]
src_sum = L.squeeze(L.matmul(src_weight, src_embed), axes=[2])
# [b, n+1, h]
dst_sum = L.squeeze(L.matmul(dst_weight, dst_embed), axes=[2])
logits = L.matmul(
src_sum, dst_sum, transpose_y=True) # [batch_size, 1, neg_num+1]
pos_label = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", 1)
neg_label = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 0)
label = L.concat([pos_label, neg_label], -1)
pos_weight = L.fill_constant_batch_size_like(logits, [-1, 1, 1],
"float32", self.neg_num)
neg_weight = L.fill_constant_batch_size_like(
logits, [-1, 1, self.neg_num], "float32", 1)
weight = L.concat([pos_weight, neg_weight], -1)
weight.stop_gradient = True
label.stop_gradient = True
loss = L.sigmoid_cross_entropy_with_logits(logits, label)
loss = loss * weight
loss = L.reduce_mean(loss)
loss = loss * ((self.neg_num + 1) / 2 / self.neg_num)
loss.persistable = True
self.loss = loss
return loss
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimized Multiprocessing Reader for PaddlePaddle
"""
import multiprocessing
import numpy as np
import time
import paddle.fluid as fluid
import pyarrow
def _serialize_serializable(obj):
"""Serialize Feed Dict
"""
return {"type": type(obj), "data": obj.__dict__}
def _deserialize_serializable(obj):
"""Deserialize Feed Dict
"""
val = obj["type"].__new__(obj["type"])
val.__dict__.update(obj["data"])
return val
context = pyarrow.default_serialization_context()
context.register_type(
object,
"object",
custom_serializer=_serialize_serializable,
custom_deserializer=_deserialize_serializable)
def serialize_data(data):
"""serialize_data"""
return pyarrow.serialize(data, context=context).to_buffer().to_pybytes()
def deserialize_data(data):
"""deserialize_data"""
return pyarrow.deserialize(data, context=context)
def multiprocess_reader(readers, use_pipe=True, queue_size=1000):
"""
multiprocess_reader use python multi process to read data from readers
and then use multiprocess.Queue or multiprocess.Pipe to merge all
data. The process number is equal to the number of input readers, each
process call one reader.
Multiprocess.Queue require the rw access right to /dev/shm, some
platform does not support.
you need to create multiple readers first, these readers should be independent
to each other so that each process can work independently.
An example:
.. code-block:: python
reader0 = reader(["file01", "file02"])
reader1 = reader(["file11", "file12"])
reader1 = reader(["file21", "file22"])
reader = multiprocess_reader([reader0, reader1, reader2],
queue_size=100, use_pipe=False)
"""
assert type(readers) is list and len(readers) > 0
def _read_into_queue(reader, queue):
"""read_into_queue"""
for sample in reader():
if sample is None:
raise ValueError("sample has None")
queue.put(serialize_data(sample))
queue.put(serialize_data(None))
def queue_reader():
"""queue_reader"""
queue = multiprocessing.Queue(queue_size)
for reader in readers:
p = multiprocessing.Process(
target=_read_into_queue, args=(reader, queue))
p.start()
reader_num = len(readers)
finish_num = 0
while finish_num < reader_num:
sample = deserialize_data(queue.get())
if sample is None:
finish_num += 1
else:
yield sample
def _read_into_pipe(reader, conn):
"""read_into_pipe"""
for sample in reader():
if sample is None:
raise ValueError("sample has None!")
conn.send(serialize_data(sample))
conn.send(serialize_data(None))
conn.close()
def pipe_reader():
"""pipe_reader"""
conns = []
for reader in readers:
parent_conn, child_conn = multiprocessing.Pipe()
conns.append(parent_conn)
p = multiprocessing.Process(
target=_read_into_pipe, args=(reader, child_conn))
p.start()
reader_num = len(readers)
finish_num = 0
conn_to_remove = []
finish_flag = np.zeros(len(conns), dtype="int32")
while finish_num < reader_num:
for conn_id, conn in enumerate(conns):
if finish_flag[conn_id] > 0:
continue
buff = conn.recv()
now = time.time()
sample = deserialize_data(buff)
out = time.time() - now
if sample is None:
finish_num += 1
conn.close()
finish_flag[conn_id] = 1
else:
yield sample
if use_pipe:
return pipe_reader
else:
return queue_reader
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Reader file.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
import time
import io
import os
import numpy as np
import paddle
from pgl.utils.logger import log
from pgl.sample import node2vec_sample
from pgl.sample import deepwalk_sample
from pgl.sample import alias_sample
from pgl.graph_kernel import skip_gram_gen_pair
from pgl.graph_kernel import alias_sample_build_table
class GESReader(object):
""" GESReader
"""
def __init__(self,
graph,
node_feat,
batch_size=512,
walk_len=40,
win_size=5,
neg_num=5,
train_files=None,
walkpath_files=None,
neg_sample_type="average"):
"""
Args:
walkpath_files: if is not None, read walk path from walkpath_files
"""
self.graph = graph
self.node_feat = node_feat
self.batch_size = batch_size
self.walk_len = walk_len
self.win_size = win_size
self.neg_num = neg_num
self.train_files = train_files
self.walkpath_files = walkpath_files
self.neg_sample_type = neg_sample_type
def walk_from_files(self):
""" walk_from_files
"""
bucket = []
while True:
for filename in self.walkpath_files:
with io.open(filename) as inf:
for line in inf:
walk = [int(x) for x in line.strip('\n\t').split('\t')]
bucket.append(walk)
if len(bucket) == self.batch_size:
yield bucket
bucket = []
if len(bucket):
yield bucket
def walk_from_graph(self):
""" walk_from_graph
"""
def node_generator():
""" node_generator
"""
if self.train_files is None:
while True:
for nodes in self.graph.node_batch_iter(self.batch_size):
yield nodes
else:
nodes = []
while True:
for filename in self.train_files:
with io.open(filename) as inf:
for line in inf:
node = int(line.strip('\n\t'))
nodes.append(node)
if len(nodes) == self.batch_size:
yield nodes
nodes = []
if len(nodes):
yield nodes
if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
log.info("Deepwalk using alias sample")
for nodes in node_generator():
if "alias" in self.graph.node_feat and "events" in self.graph.node_feat:
walks = deepwalk_sample(self.graph, nodes, self.walk_len,
"alias", "events")
else:
walks = deepwalk_sample(self.graph, nodes, self.walk_len)
yield walks
def walk_generator(self):
""" walk_generator
"""
if self.walkpath_files is not None:
for i in self.walk_from_files():
yield i
else:
for i in self.walk_from_graph():
yield i
def __call__(self):
np.random.seed(os.getpid())
if self.neg_sample_type == "outdegree":
outdegree = self.graph.outdegree()
distribution = 1. * outdegree / outdegree.sum()
alias, events = alias_sample_build_table(distribution)
max_len = int(self.batch_size * self.walk_len * (
(1 + self.win_size) - 0.3))
for walks in self.walk_generator():
src, pos = [], []
for walk in walks:
s, p = skip_gram_gen_pair(walk, self.win_size)
src.extend(s), pos.extend(p)
src = np.array(src, dtype=np.int64),
pos = np.array(pos, dtype=np.int64)
src, pos = np.reshape(src, [-1, 1, 1]), np.reshape(pos, [-1, 1, 1])
if src.shape[0] == 0:
continue
neg_sample_size = [len(pos), self.neg_num, 1]
if self.neg_sample_type == "average":
negs = self.graph.sample_nodes(neg_sample_size)
elif self.neg_sample_type == "outdegree":
negs = alias_sample(neg_sample_size, alias, events)
# [batch_size, 1, 1] [batch_size, neg_num+1, 1]
dst = np.concatenate([pos, negs], 1)
src_feat = np.concatenate([src, self.node_feat[src[:, :, 0]]], -1)
dst_feat = np.concatenate([dst, self.node_feat[dst[:, :, 0]]], -1)
src_feat, dst_feat = np.expand_dims(src_feat, -1), np.expand_dims(
dst_feat, -1)
yield src_feat[:max_len], dst_feat[:max_len]
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the dataset for GIN model.
"""
import os
import sys
import numpy as np
from sklearn.model_selection import StratifiedKFold
import pgl
from pgl.utils.logger import log
def fold10_split(dataset, fold_idx=0, seed=0, shuffle=True):
"""10 fold splitter"""
assert 0 <= fold_idx and fold_idx < 10, print(
"fold_idx must be from 0 to 9.")
skf = StratifiedKFold(n_splits=10, shuffle=shuffle, random_state=seed)
labels = []
for i in range(len(dataset)):
g, c = dataset[i]
labels.append(c)
idx_list = []
for idx in skf.split(np.zeros(len(labels)), labels):
idx_list.append(idx)
train_idx, valid_idx = idx_list[fold_idx]
log.info("train_set : test_set == %d : %d" %
(len(train_idx), len(valid_idx)))
return Subset(dataset, train_idx), Subset(dataset, valid_idx)
def random_split(dataset, split_ratio=0.7, seed=0, shuffle=True):
"""random splitter"""
np.random.seed(seed)
indices = list(range(len(dataset)))
np.random.shuffle(indices)
split = int(split_ratio * len(dataset))
train_idx, valid_idx = indices[:split], indices[split:]
log.info("train_set : test_set == %d : %d" %
(len(train_idx), len(valid_idx)))
return Subset(dataset, train_idx), Subset(dataset, valid_idx)
class BaseDataset(object):
"""BaseDataset"""
def __init__(self):
pass
def __getitem__(self, idx):
"""getitem"""
raise NotImplementedError
def __len__(self):
"""len"""
raise NotImplementedError
class Subset(BaseDataset):
"""
Subset of a dataset at specified indices.
"""
def __init__(self, dataset, indices):
self.dataset = dataset
self.indices = indices
def __getitem__(self, idx):
"""getitem"""
return self.dataset[self.indices[idx]]
def __len__(self):
"""len"""
return len(self.indices)
class GINDataset(BaseDataset):
"""Dataset for Graph Isomorphism Network (GIN)
Adapted from https://github.com/weihua916/powerful-gnns/blob/master/dataset.zip.
"""
def __init__(self,
data_path,
dataset_name,
self_loop,
degree_as_nlabel=False):
self.data_path = data_path
self.dataset_name = dataset_name
self.self_loop = self_loop
self.degree_as_nlabel = degree_as_nlabel
self.graph_list = []
self.glabel_list = []
# relabel
self.glabel_dict = {}
self.nlabel_dict = {}
self.elabel_dict = {}
self.ndegree_dict = {}
# global num
self.num_graph = 0 # total graphs number
self.n = 0 # total nodes number
self.m = 0 # total edges number
# global num of classes
self.gclasses = 0
self.nclasses = 0
self.eclasses = 0
self.dim_nfeats = 0
# flags
self.degree_as_nlabel = degree_as_nlabel
self.nattrs_flag = False
self.nlabels_flag = False
self._load_data()
def __len__(self):
"""return the number of graphs"""
return len(self.graph_list)
def __getitem__(self, idx):
"""getitem"""
return self.graph_list[idx], self.glabel_list[idx]
def _load_data(self):
"""Loads dataset
"""
filename = os.path.join(self.data_path, self.dataset_name,
"%s.txt" % self.dataset_name)
log.info("loading data from %s" % filename)
with open(filename, 'r') as reader:
# first line --> N, means total number of graphs
self.num_graph = int(reader.readline().strip())
for i in range(self.num_graph):
if (i + 1) % int(self.num_graph / 10) == 0:
log.info("processing graph %s" % (i + 1))
graph = dict()
# second line --> [num_node, label]
# means [node number of a graph, class label of a graph]
grow = reader.readline().strip().split()
n_nodes, glabel = [int(w) for w in grow]
# relabel graphs
if glabel not in self.glabel_dict:
mapped = len(self.glabel_dict)
self.glabel_dict[glabel] = mapped
graph['num_nodes'] = n_nodes
self.glabel_list.append(self.glabel_dict[glabel])
nlabels = []
node_features = []
num_edges = 0
edges = []
for j in range(graph['num_nodes']):
slots = reader.readline().strip().split()
# handle edges and node feature(if has)
tmp = int(slots[
1]) + 2 # tmp == 2 + num_edges of current node
if tmp == len(slots):
# no node feature
nrow = [int(w) for w in slots]
nfeat = None
elif tmp < len(slots):
nrow = [int(w) for w in slots[:tmp]]
nfeat = [float(w) for w in slots[tmp:]]
node_features.append(nfeat)
else:
raise Exception('edge number is not correct!')
# relabel nodes if is has labels
# if it doesn't have node labels, then every nrow[0] == 0
if not nrow[0] in self.nlabel_dict:
mapped = len(self.nlabel_dict)
self.nlabel_dict[nrow[0]] = mapped
nlabels.append(self.nlabel_dict[nrow[0]])
num_edges += nrow[1]
edges.extend([(j, u) for u in nrow[2:]])
if self.self_loop:
num_edges += 1
edges.append((j, j))
if node_features != []:
node_features = np.stack(node_features)
graph['attr'] = node_features
self.nattrs_flag = True
else:
node_features = None
graph['attr'] = node_features
graph['nlabel'] = np.array(
nlabels, dtype="int64").reshape(-1, 1)
if len(self.nlabel_dict) > 1:
self.nlabels_flag = True
graph['edges'] = edges
assert num_edges == len(edges)
g = pgl.graph.Graph(
num_nodes=graph['num_nodes'],
edges=graph['edges'],
node_feat={
'nlabel': graph['nlabel'],
'attr': graph['attr']
})
self.graph_list.append(g)
# update statistics of graphs
self.n += graph['num_nodes']
self.m += num_edges
# if no attr
if not self.nattrs_flag:
log.info('there are no node features in this dataset!')
label2idx = {}
# generate node attr by node degree
if self.degree_as_nlabel:
log.info('generate node features by node degree...')
nlabel_set = set([])
for g in self.graph_list:
g.node_feat['nlabel'] = g.indegree()
# extracting unique node labels
nlabel_set = nlabel_set.union(set(g.node_feat['nlabel']))
g.node_feat['nlabel'] = g.node_feat['nlabel'].reshape(-1,
1)
nlabel_set = list(nlabel_set)
# in case the labels/degrees are not continuous number
self.ndegree_dict = {
nlabel_set[i]: i
for i in range(len(nlabel_set))
}
label2idx = self.ndegree_dict
# generate node attr by node label
else:
log.info('generate node features by node label...')
label2idx = self.nlabel_dict
for g in self.graph_list:
attr = np.zeros((g.num_nodes, len(label2idx)))
idx = [
label2idx[tag]
for tag in g.node_feat['nlabel'].reshape(-1, )
]
attr[:, idx] = 1
g.node_feat['attr'] = attr.astype("float32")
# after load, get the #classes and #dim
self.gclasses = len(self.glabel_dict)
self.nclasses = len(self.nlabel_dict)
self.eclasses = len(self.elabel_dict)
self.dim_nfeats = len(self.graph_list[0].node_feat['attr'][0])
message = "finished loading data\n"
message += """
num_graph: %d
num_graph_class: %d
total_num_nodes: %d
node Classes: %d
node_features_dim: %d
num_edges: %d
edge_classes: %d
Avg. of #Nodes: %.2f
Avg. of #Edges: %.2f
Graph Relabeled: %s
Node Relabeled: %s
Degree Relabeled(If degree_as_nlabel=True): %s""" % (
self.num_graph,
self.gclasses,
self.n,
self.nclasses,
self.dim_nfeats,
self.m,
self.eclasses,
self.n / self.num_graph,
self.m / self.num_graph,
self.glabel_dict,
self.nlabel_dict,
self.ndegree_dict, )
log.info(message)
if __name__ == "__main__":
gindataset = GINDataset(
"./dataset/", "MUTAG", self_loop=True, degree_as_nlabel=False)
# Graph Isomorphism Network (GIN)
[Graph Isomorphism Network \(GIN\)](https://arxiv.org/pdf/1810.00826.pdf) is a simple graph neural network that expects to achieve the ability as the Weisfeiler-Lehman graph isomorphism test. Based on PGL, we reproduce the GIN model.
### Datasets
The dataset can be downloaded from [here](https://github.com/weihua916/powerful-gnns/blob/master/dataset.zip)
### Dependencies
- paddlepaddle 1.6
- pgl 1.0.2
### How to run
For examples, use GPU to train GIN model on MUTAG dataset.
```
python main.py --use_cuda --dataset_name MUTAG
```
### Hyperparameters
- data\_path: the root path of your dataset
- dataset\_name: the name of the dataset
- fold\_idx: The $fold\_idx^{th}$ fold of dataset splited. Here we use 10 fold cross-validation
- train\_eps: whether the $\epsilon$ parameter is learnable.
### Experiment results (Accuracy)
| |MUTAG | COLLAB | IMDBBINARY | IMDBMULTI |
|--|-------------|----------|------------|-----------------|
|PGL result | 90.8 | 78.6 | 76.8 | 50.8 |
|paper reuslt |90.0 | 80.0 | 75.1 | 52.3 |
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the graph dataloader.
"""
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals
from __future__ import absolute_import
import os
import sys
import time
import argparse
import numpy as np
import collections
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as fl
import pgl
from pgl.utils import mp_reader
from pgl.utils.logger import log
def batch_iter(data, batch_size, fid, num_workers):
"""node_batch_iter
"""
size = len(data)
perm = np.arange(size)
np.random.shuffle(perm)
start = 0
cc = 0
while start < size:
index = perm[start:start + batch_size]
start += batch_size
cc += 1
if cc % num_workers != fid:
continue
yield data[index]
def scan_batch_iter(data, batch_size, fid, num_workers):
"""scan_batch_iter
"""
batch = []
cc = 0
for line_example in data.scan():
cc += 1
if cc % num_workers != fid:
continue
batch.append(line_example)
if len(batch) == batch_size:
yield batch
batch = []
if len(batch) > 0:
yield batch
class GraphDataloader(object):
"""Graph Dataloader
"""
def __init__(
self,
dataset,
batch_size,
seed=0,
num_workers=1,
buf_size=1000,
shuffle=True, ):
self.shuffle = shuffle
self.seed = seed
self.num_workers = num_workers
self.buf_size = buf_size
self.batch_size = batch_size
self.dataset = dataset
def batch_fn(self, batch_examples):
""" batch_fn batch producer"""
graphs = [b[0] for b in batch_examples]
labels = [b[1] for b in batch_examples]
join_graph = pgl.graph.MultiGraph(graphs)
labels = np.array(labels, dtype="int64").reshape(-1, 1)
return join_graph, labels
# feed_dict = self.graph_wrapper.to_feed(join_graph)
# raise NotImplementedError("No defined Batch Fn")
def batch_iter(self, fid):
"""batch_iter"""
if self.shuffle:
for batch in batch_iter(self, self.batch_size, fid,
self.num_workers):
yield batch
else:
for batch in scan_batch_iter(self, self.batch_size, fid,
self.num_workers):
yield batch
def __len__(self):
"""__len__"""
return len(self.dataset)
def __getitem__(self, idx):
"""__getitem__"""
if isinstance(idx, collections.Iterable):
return [self[bidx] for bidx in idx]
else:
return self.dataset[idx]
def __iter__(self):
"""__iter__"""
def worker(filter_id):
def func_run():
for batch_examples in self.batch_iter(filter_id):
batch_dict = self.batch_fn(batch_examples)
yield batch_dict
return func_run
if self.num_workers == 1:
r = paddle.reader.buffered(worker(0), self.buf_size)
else:
worker_pool = [worker(wid) for wid in range(self.num_workers)]
worker = mp_reader.multiprocess_reader(
worker_pool, use_pipe=True, queue_size=1000)
r = paddle.reader.buffered(worker, self.buf_size)
for batch in r():
yield batch
def scan(self):
"""scan"""
for example in self.dataset:
yield example
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the training process of GIN model.
"""
import os
import sys
import time
import argparse
import numpy as np
import paddle.fluid as fluid
import paddle.fluid.layers as fl
import pgl
from pgl.utils.logger import log
from Dataset import GINDataset, fold10_split, random_split
from dataloader import GraphDataloader
from model import GINModel
def main(args):
"""main function"""
dataset = GINDataset(
args.data_path,
args.dataset_name,
self_loop=not args.train_eps,
degree_as_nlabel=True)
train_dataset, test_dataset = fold10_split(
dataset, fold_idx=args.fold_idx, seed=args.seed)
train_loader = GraphDataloader(train_dataset, batch_size=args.batch_size)
test_loader = GraphDataloader(
test_dataset, batch_size=args.batch_size, shuffle=False)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.GraphWrapper(
"gw", place=place, node_feat=dataset[0][0].node_feat_info())
model = GINModel(args, gw, dataset.gclasses)
model.forward()
infer_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
epoch_step = int(len(train_dataset) / args.batch_size) + 1
boundaries = [
i
for i in range(50 * epoch_step, args.epochs * epoch_step,
epoch_step * 50)
]
values = [args.lr * 0.5**i for i in range(0, len(boundaries) + 1)]
lr = fl.piecewise_decay(boundaries=boundaries, values=values)
train_op = fluid.optimizer.Adam(lr).minimize(model.loss)
exe = fluid.Executor(place)
exe.run(startup_program)
# train and evaluate
global_step = 0
for epoch in range(1, args.epochs + 1):
for idx, batch_data in enumerate(train_loader):
g, labels = batch_data
feed_dict = gw.to_feed(g)
feed_dict['labels'] = labels
ret_loss, ret_lr, ret_acc = exe.run(
train_program,
feed=feed_dict,
fetch_list=[model.loss, lr, model.acc])
global_step += 1
if global_step % 10 == 0:
message = "epoch %d | step %d | " % (epoch, global_step)
message += "lr %.6f | loss %.6f | acc %.4f" % (
ret_lr, ret_loss, ret_acc)
log.info(message)
# evaluate
result = evaluate(exe, infer_program, model, gw, test_loader)
message = "evaluating result"
for key, value in result.items():
message += " | %s %.6f" % (key, value)
log.info(message)
def evaluate(exe, prog, model, gw, loader):
"""evaluate"""
total_loss = []
total_acc = []
for idx, batch_data in enumerate(loader):
g, labels = batch_data
feed_dict = gw.to_feed(g)
feed_dict['labels'] = labels
ret_loss, ret_acc = exe.run(prog,
feed=feed_dict,
fetch_list=[model.loss, model.acc])
total_loss.append(ret_loss)
total_acc.append(ret_acc)
total_loss = np.mean(total_loss)
total_acc = np.mean(total_acc)
return {"loss": total_loss, "acc": total_acc}
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--data_path', type=str, default='./dataset')
parser.add_argument('--dataset_name', type=str, default='MUTAG')
parser.add_argument('--batch_size', type=int, default=32)
parser.add_argument('--fold_idx', type=int, default=0)
parser.add_argument('--output_path', type=str, default='./outputs/')
parser.add_argument('--use_cuda', action='store_true')
parser.add_argument('--num_layers', type=int, default=5)
parser.add_argument('--num_mlp_layers', type=int, default=2)
parser.add_argument('--hidden_size', type=int, default=64)
parser.add_argument(
'--pool_type',
type=str,
default="sum",
choices=["sum", "average", "max"])
parser.add_argument('--train_eps', action='store_true')
parser.add_argument('--epochs', type=int, default=350)
parser.add_argument('--lr', type=float, default=0.01)
parser.add_argument('--dropout_prob', type=float, default=0.5)
parser.add_argument('--seed', type=int, default=0)
args = parser.parse_args()
log.info(args)
if not os.path.exists(args.output_path):
os.makedirs(args.output_path)
main(args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""This file implement the GIN model.
"""
import numpy as np
import paddle.fluid as fluid
import paddle.fluid.layers as fl
import pgl
from pgl.layers.conv import gin
class GINModel(object):
"""GINModel"""
def __init__(self, args, gw, num_class):
self.args = args
self.num_layers = self.args.num_layers
self.hidden_size = self.args.hidden_size
self.train_eps = self.args.train_eps
self.pool_type = self.args.pool_type
self.dropout_prob = self.args.dropout_prob
self.num_class = num_class
self.gw = gw
self.labels = fl.data(name="labels", shape=[None, 1], dtype="int64")
def forward(self):
"""forward"""
features_list = [self.gw.node_feat["attr"]]
for i in range(self.num_layers):
h = gin(self.gw,
features_list[i],
hidden_size=self.hidden_size,
activation="relu",
name="gin_%s" % (i),
init_eps=0.0,
train_eps=self.train_eps)
h = fl.batch_norm(h)
h = fl.relu(h)
features_list.append(h)
output = 0
for i, h in enumerate(features_list):
pooled_h = pgl.layers.graph_pooling(self.gw, h, self.pool_type)
drop_h = fl.dropout(
pooled_h,
self.dropout_prob,
dropout_implementation="upscale_in_train")
output += fl.fc(drop_h,
size=self.num_class,
act=None,
param_attr=fluid.ParamAttr(name="final_fc_%s" %
(i)))
# calculate loss
self.loss = fl.softmax_with_cross_entropy(output, self.labels)
self.loss = fl.reduce_mean(self.loss)
self.acc = fl.accuracy(fl.softmax(output), self.labels)
# GraphSAGE in PGL
# GraphSAGE: Inductive Representation Learning on Large Graphs
[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.
......@@ -12,16 +12,23 @@ The reddit dataset should be downloaded from the following links and placed in d
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### How to run
To train a GraphSAGE model on Reddit Dataset, you can just run
```
python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry
```
If you want to train a GraphSAGE model with multiple GPUs, you can just run
```
CUDA_VISIBLE_DEVICES=0,1 python train_multi.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry --num_trainer 2
```
#### Hyperparameters
- epoch: Number of epochs default (10)
......
......@@ -17,12 +17,15 @@ import paddle
import paddle.fluid as fluid
import pgl
import time
from pgl.utils import mp_reader
from pgl.utils.logger import log
import train
import time
import copy
def node_batch_iter(nodes, node_label, batch_size):
"""node_batch_iter
"""
perm = np.arange(len(nodes))
np.random.shuffle(perm)
start = 0
......@@ -33,6 +36,8 @@ def node_batch_iter(nodes, node_label, batch_size):
def traverse(item):
"""traverse
"""
if isinstance(item, list) or isinstance(item, np.ndarray):
for i in iter(item):
for j in traverse(i):
......@@ -41,35 +46,56 @@ def traverse(item):
yield item
def flat_node_and_edge(nodes, eids):
def flat_node_and_edge(nodes):
"""flat_node_and_edge
"""
nodes = list(set(traverse(nodes)))
eids = list(set(traverse(eids)))
return nodes, eids
return nodes
def worker(batch_info, graph, samples):
def worker(batch_info, graph, graph_wrapper, samples):
"""Worker
"""
def work():
"""work
"""
_graph_wrapper = copy.copy(graph_wrapper)
_graph_wrapper.node_feat_tensor_dict = {}
for batch_train_samples, batch_train_labels in batch_info:
start_nodes = batch_train_samples
nodes = start_nodes
eids = []
edges = []
for max_deg in samples:
pred, pred_eid = graph.sample_predecessor(
start_nodes, max_degree=max_deg, return_eids=True)
pred_nodes = graph.sample_predecessor(
start_nodes, max_degree=max_deg)
for dst_node, src_nodes in zip(start_nodes, pred_nodes):
for src_node in src_nodes:
edges.append((src_node, dst_node))
last_nodes = nodes
nodes = [nodes, pred]
eids = [eids, pred_eid]
nodes, eids = flat_node_and_edge(nodes, eids)
nodes = [nodes, pred_nodes]
nodes = flat_node_and_edge(nodes)
# Find new nodes
start_nodes = list(set(nodes) - set(last_nodes))
if len(start_nodes) == 0:
break
feed_dict = {}
feed_dict["nodes"] = [int(n) for n in nodes]
feed_dict["eids"] = [int(e) for e in eids]
feed_dict["node_label"] = [int(n) for n in batch_train_labels]
feed_dict["node_index"] = [int(n) for n in batch_train_samples]
subgraph = graph.subgraph(
nodes=nodes,
edges=edges,
with_node_feat=False,
with_edge_feat=False)
sub_node_index = subgraph.reindex_from_parrent_nodes(
batch_train_samples)
feed_dict = _graph_wrapper.to_feed(subgraph)
feed_dict["node_label"] = np.expand_dims(
np.array(
batch_train_labels, dtype="int64"), -1)
feed_dict["node_index"] = sub_node_index
feed_dict["parent_node_index"] = np.array(nodes, dtype="int64")
yield feed_dict
return work
......@@ -81,27 +107,31 @@ def multiprocess_graph_reader(graph,
node_index,
batch_size,
node_label,
with_parent_node_index=False,
num_workers=4):
def parse_to_subgraph(rd):
"""multiprocess_graph_reader
"""
def parse_to_subgraph(rd, prefix, node_feat, _with_parent_node_index):
"""parse_to_subgraph
"""
def work():
"""work
"""
for data in rd():
nodes = data["nodes"]
eids = data["eids"]
batch_train_labels = data["node_label"]
batch_train_samples = data["node_index"]
subgraph = graph.subgraph(nodes=nodes, eid=eids)
sub_node_index = subgraph.reindex_from_parrent_nodes(
batch_train_samples)
feed_dict = graph_wrapper.to_feed(subgraph)
feed_dict["node_label"] = np.expand_dims(
np.array(
batch_train_labels, dtype="int64"), -1)
feed_dict["node_index"] = sub_node_index
feed_dict = data
for key in node_feat:
feed_dict[prefix + '/node_feat/' + key] = node_feat[key][
feed_dict["parent_node_index"]]
if not _with_parent_node_index:
del feed_dict["parent_node_index"]
yield feed_dict
return work
def reader():
"""reader"""
batch_info = list(
node_batch_iter(
node_index, node_label, batch_size=batch_size))
......@@ -110,44 +140,18 @@ def multiprocess_graph_reader(graph,
for i in range(num_workers):
reader_pool.append(
worker(batch_info[block_size * i:block_size * (i + 1)], graph,
samples))
multi_process_sample = paddle.reader.multiprocess_reader(
reader_pool, use_pipe=False)
r = parse_to_subgraph(multi_process_sample)
return paddle.reader.buffered(r, 1000)
graph_wrapper, samples))
if len(reader_pool) == 1:
r = parse_to_subgraph(reader_pool[0],
repr(graph_wrapper), graph.node_feat,
with_parent_node_index)
else:
multi_process_sample = mp_reader.multiprocess_reader(
reader_pool, use_pipe=True, queue_size=1000)
r = parse_to_subgraph(multi_process_sample,
repr(graph_wrapper), graph.node_feat,
with_parent_node_index)
return paddle.reader.buffered(r, num_workers)
return reader()
def graph_reader(graph, graph_wrapper, samples, node_index, batch_size,
node_label):
def reader():
for batch_train_samples, batch_train_labels in node_batch_iter(
node_index, node_label, batch_size=batch_size):
start_nodes = batch_train_samples
nodes = start_nodes
eids = []
for max_deg in samples:
pred, pred_eid = graph.sample_predecessor(
start_nodes, max_degree=max_deg, return_eids=True)
last_nodes = nodes
nodes = [nodes, pred]
eids = [eids, pred_eid]
nodes, eids = flat_node_and_edge(nodes, eids)
# Find new nodes
start_nodes = list(set(nodes) - set(last_nodes))
if len(start_nodes) == 0:
break
subgraph = graph.subgraph(nodes=nodes, eid=eids)
feed_dict = graph_wrapper.to_feed(subgraph)
sub_node_index = subgraph.reindex_from_parrent_nodes(
batch_train_samples)
feed_dict["node_label"] = np.expand_dims(
np.array(
batch_train_labels, dtype="int64"), -1)
feed_dict["node_index"] = np.array(sub_node_index, dtype="int32")
yield feed_dict
return paddle.reader.buffered(reader, 1000)
......@@ -11,6 +11,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import time
......@@ -34,8 +35,9 @@ def load_data(normalize=True, symmetry=True):
reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
"""
data = np.load("data/reddit.npz")
adj = sp.load_npz("data/reddit_adj.npz")
data_dir = os.path.dirname(os.path.abspath(__file__))
data = np.load(os.path.join(data_dir, "data/reddit.npz"))
adj = sp.load_npz(os.path.join(data_dir, "data/reddit_adj.npz"))
if symmetry:
adj = adj + adj.T
adj = adj.tocoo()
......@@ -61,10 +63,7 @@ def load_data(normalize=True, symmetry=True):
log.info("Feature shape %s" % (repr(feature.shape)))
graph = pgl.graph.Graph(
num_nodes=feature.shape[0],
edges=list(zip(src, dst)),
node_feat={"index": np.arange(
0, len(feature), dtype="int32")})
num_nodes=feature.shape[0], edges=list(zip(src, dst)))
return {
"graph": graph,
......@@ -82,12 +81,18 @@ def load_data(normalize=True, symmetry=True):
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size, feature):
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int32", append_batch_size=False)
"node_index", shape=[None], dtype="int64", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
feature = fluid.layers.gather(feature, graph_wrapper.node_feat['index'])
parent_node_index = fluid.layers.data(
"parent_node_index",
shape=[None],
dtype="int64",
append_batch_size=False)
feature = fluid.layers.gather(feature, parent_node_index)
feature.stop_gradient = True
for i in range(k_hop):
......@@ -97,28 +102,28 @@ def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
feature,
hidden_size,
act="relu",
name="graphsage_mean_%s % i")
name="graphsage_mean_%s" % i)
elif graphsage_type == 'graphsage_meanpool':
feature = graphsage_meanpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_meanpool_%s % i")
name="graphsage_meanpool_%s" % i)
elif graphsage_type == 'graphsage_maxpool':
feature = graphsage_maxpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s % i")
name="graphsage_maxpool_%s" % i)
elif graphsage_type == 'graphsage_lstm':
feature = graphsage_lstm(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s % i")
name="graphsage_maxpool_%s" % i)
else:
raise ValueError("graphsage type %s is not"
" implemented" % graphsage_type)
......@@ -198,7 +203,9 @@ def main(args):
hide_batch_size=False)
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph", place, node_feat=data['graph'].node_feat_info())
"sub_graph",
fluid.CPUPlace(),
node_feat=data['graph'].node_feat_info())
model_loss, model_acc = build_graph_model(
graph_wrapper,
num_class=data["num_class"],
......@@ -217,59 +224,35 @@ def main(args):
exe.run(startup_program)
feature_init(place)
if args.sample_workers > 1:
train_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
else:
train_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
if args.sample_workers > 1:
val_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
else:
val_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
if args.sample_workers > 1:
test_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
else:
test_iter = reader.graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
train_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
with_parent_node_index=True,
node_index=data['train_index'],
node_label=data["train_label"])
val_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
with_parent_node_index=True,
node_index=data['val_index'],
node_label=data["val_label"])
test_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
with_parent_node_index=True,
node_index=data['test_index'],
node_label=data["test_label"])
for epoch in range(args.epoch):
run_epoch(
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import argparse
import time
import sys
import traceback
import numpy as np
import scipy.sparse as sp
from sklearn.preprocessing import StandardScaler
import pgl
from pgl.utils.logger import log
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import reader
from model import graphsage_mean, graphsage_meanpool,\
graphsage_maxpool, graphsage_lstm
def load_data(normalize=True, symmetry=True):
"""
data from https://github.com/matenure/FastGCN/issues/8
reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
"""
data_dir = os.path.dirname(os.path.abspath(__file__))
data = np.load(os.path.join(data_dir, "data/reddit.npz"))
adj = sp.load_npz(os.path.join(data_dir, "data/reddit_adj.npz"))
if symmetry:
adj = adj + adj.T
adj = adj.tocoo()
src = adj.row
dst = adj.col
num_class = 41
train_label = data['y_train']
val_label = data['y_val']
test_label = data['y_test']
train_index = data['train_index']
val_index = data['val_index']
test_index = data['test_index']
feature = data["feats"].astype("float32")
if normalize:
scaler = StandardScaler()
scaler.fit(feature[train_index])
feature = scaler.transform(feature)
log.info("Feature shape %s" % (repr(feature.shape)))
graph = pgl.graph.Graph(
num_nodes=feature.shape[0],
edges=list(zip(src, dst)),
node_feat={"feat": feature.astype("float32")})
return {
"graph": graph,
"train_index": train_index,
"train_label": train_label,
"val_label": val_label,
"val_index": val_index,
"test_index": test_index,
"test_label": test_label,
"num_class": 41
}
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size):
"""build_graph_model"""
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int64", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
feature = graph_wrapper.node_feat["feat"]
for i in range(k_hop):
if graphsage_type == 'graphsage_mean':
feature = graphsage_mean(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_mean_%s" % i)
elif graphsage_type == 'graphsage_meanpool':
feature = graphsage_meanpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_meanpool_%s" % i)
elif graphsage_type == 'graphsage_maxpool':
feature = graphsage_maxpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
elif graphsage_type == 'graphsage_lstm':
feature = graphsage_lstm(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s" % i)
else:
raise ValueError("graphsage type %s is not"
" implemented" % graphsage_type)
feature = fluid.layers.gather(feature, node_index)
logits = fluid.layers.fc(feature,
num_class,
act=None,
name='classification_layer')
proba = fluid.layers.softmax(logits)
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=node_label)
loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
return loss, acc
def to_multidevice(batch_iter, num_trainer):
"""to_multidevice"""
batch_dict = []
for batch in batch_iter():
batch_dict.append(batch)
if len(batch_dict) == num_trainer:
yield batch_dict
batch_dict = []
if len(batch_dict) > 0:
log.warning("The batch (%s) can't fill all device (%s)"
"which will be discarded." %
(len(batch_dict), num_trainer))
def run_epoch(batch_iter,
exe,
program,
prefix,
model_loss,
model_acc,
epoch,
log_per_step=100,
num_trainer=1):
"""run_epoch"""
batch = 0
total_loss = 0.
total_acc = 0.
total_sample = 0
start = time.time()
if num_trainer > 1:
batch_iter = to_multidevice(batch_iter, num_trainer)
else:
batch_iter = batch_iter()
for batch_feed_dict in batch_iter:
batch += 1
if num_trainer > 1:
batch_loss, batch_acc = exe.run(
fetch_list=[model_loss.name, model_acc.name],
feed=batch_feed_dict)
batch_loss = np.mean(batch_loss)
batch_acc = np.mean(batch_acc)
else:
batch_loss, batch_acc = exe.run(
program,
fetch_list=[model_loss.name, model_acc.name],
feed=batch_feed_dict)
if batch % log_per_step == 0:
log.info("Batch %s %s-Loss %s %s-Acc %s" %
(batch, prefix, batch_loss, prefix, batch_acc))
if num_trainer > 1:
num_samples = sum(
[len(_batch["node_index"]) for _batch in batch_feed_dict])
else:
num_samples = len(batch_feed_dict["node_index"])
total_loss += batch_loss * num_samples
total_acc += batch_acc * num_samples
total_sample += num_samples
end = time.time()
log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
(prefix, epoch, total_loss / total_sample,
total_acc / total_sample, (end - start) / batch))
def main(args):
"""main"""
data = load_data(args.normalize, args.symmetry)
log.info("preprocess finish")
log.info("Train Examples: %s" % len(data["train_index"]))
log.info("Val Examples: %s" % len(data["val_index"]))
log.info("Test Examples: %s" % len(data["test_index"]))
log.info("Num nodes %s" % data["graph"].num_nodes)
log.info("Num edges %s" % data["graph"].num_edges)
log.info("Average Degree %s" % np.mean(data["graph"].indegree()))
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
samples = []
if args.samples_1 > 0:
samples.append(args.samples_1)
if args.samples_2 > 0:
samples.append(args.samples_2)
with fluid.program_guard(train_program, startup_program):
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph",
fluid.CPUPlace(),
node_feat=data['graph'].node_feat_info())
model_loss, model_acc = build_graph_model(
graph_wrapper,
num_class=data["num_class"],
hidden_size=args.hidden_size,
graphsage_type=args.graphsage_type,
k_hop=len(samples))
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=args.lr)
adam.minimize(model_loss)
exe = fluid.Executor(place)
exe.run(startup_program)
if args.num_trainer > 1:
build_strategy = fluid.BuildStrategy()
build_strategy.remove_unnecessary_lock = False
build_strategy.enable_sequential_execution = True
train_exe = fluid.ParallelExecutor(
use_cuda=args.use_cuda,
main_program=train_program,
build_strategy=build_strategy,
loss_name=model_loss.name)
else:
train_exe = exe
train_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
val_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
test_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
for epoch in range(args.epoch):
run_epoch(
train_iter,
program=train_program,
exe=train_exe,
prefix="train",
model_loss=model_loss,
model_acc=model_acc,
num_trainer=args.num_trainer,
epoch=epoch)
run_epoch(
val_iter,
program=test_program,
exe=exe,
prefix="val",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
run_epoch(
test_iter,
program=test_program,
prefix="test",
exe=exe,
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='graphsage')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
"--normalize", action='store_true', help="normalize features")
parser.add_argument(
"--symmetry", action='store_true', help="undirect graph")
parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
parser.add_argument("--sample_workers", type=int, default=5)
parser.add_argument("--epoch", type=int, default=10)
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--num_trainer", type=int, default=1)
parser.add_argument("--lr", type=float, default=0.01)
parser.add_argument("--samples_1", type=int, default=25)
parser.add_argument("--samples_2", type=int, default=10)
args = parser.parse_args()
log.info(args)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Multi-GPU settings
"""
import argparse
import time
import numpy as np
import scipy.sparse as sp
from sklearn.preprocessing import StandardScaler
import pgl
from pgl.utils.logger import log
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import reader
from model import graphsage_mean, graphsage_meanpool,\
graphsage_maxpool, graphsage_lstm
def fixed_offset(data, num_nodes, scale):
"""Test
"""
len_data = len(data)
len_per_part = int(len_data / scale)
offset = np.arange(0, scale, dtype="int64")
offset = offset * num_nodes
offset = np.repeat(offset, len_per_part)
if len(data.shape) > 1:
data += offset.reshape([-1, 1])
else:
data += offset
def load_data(normalize=True, symmetry=True, scale=1):
"""
data from https://github.com/matenure/FastGCN/issues/8
reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
"""
data = np.load("data/reddit.npz")
adj = sp.load_npz("data/reddit_adj.npz")
if symmetry:
adj = adj + adj.T
adj = adj.tocoo()
src = adj.row.reshape([-1, 1])
dst = adj.col.reshape([-1, 1])
edges = np.hstack([src, dst])
num_class = 41
train_label = data['y_train']
val_label = data['y_val']
test_label = data['y_test']
train_index = data['train_index']
val_index = data['val_index']
test_index = data['test_index']
feature = data["feats"].astype("float32")
if normalize:
scaler = StandardScaler()
scaler.fit(feature[train_index])
feature = scaler.transform(feature)
if scale > 1:
num_nodes = feature.shape[0]
feature = np.tile(feature, [scale, 1])
train_label = np.tile(train_label, [scale])
val_label = np.tile(val_label, [scale])
test_label = np.tile(test_label, [scale])
edges = np.tile(edges, [scale, 1])
fixed_offset(edges, num_nodes, scale)
train_index = np.tile(train_index, [scale])
fixed_offset(train_index, num_nodes, scale)
val_index = np.tile(val_index, [scale])
fixed_offset(val_index, num_nodes, scale)
test_index = np.tile(test_index, [scale])
fixed_offset(test_index, num_nodes, scale)
log.info("Feature shape %s" % (repr(feature.shape)))
graph = pgl.graph.Graph(
num_nodes=feature.shape[0],
edges=edges,
node_feat={"feature": feature})
return {
"graph": graph,
"train_index": train_index,
"train_label": train_label,
"val_label": val_label,
"val_index": val_index,
"test_index": test_index,
"test_label": test_label,
"feature": feature,
"num_class": 41
}
def build_graph_model(graph_wrapper, num_class, k_hop, graphsage_type,
hidden_size, feature):
"""Test"""
node_index = fluid.layers.data(
"node_index", shape=[None], dtype="int64", append_batch_size=False)
node_label = fluid.layers.data(
"node_label", shape=[None, 1], dtype="int64", append_batch_size=False)
for i in range(k_hop):
if graphsage_type == 'graphsage_mean':
feature = graphsage_mean(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_mean_%s % i")
elif graphsage_type == 'graphsage_meanpool':
feature = graphsage_meanpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_meanpool_%s % i")
elif graphsage_type == 'graphsage_maxpool':
feature = graphsage_maxpool(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s % i")
elif graphsage_type == 'graphsage_lstm':
feature = graphsage_lstm(
graph_wrapper,
feature,
hidden_size,
act="relu",
name="graphsage_maxpool_%s % i")
else:
raise ValueError("graphsage type %s is not"
" implemented" % graphsage_type)
feature = fluid.layers.gather(feature, node_index)
logits = fluid.layers.fc(feature,
num_class,
act=None,
name='classification_layer')
proba = fluid.layers.softmax(logits)
loss = fluid.layers.softmax_with_cross_entropy(
logits=logits, label=node_label)
loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=proba, label=node_label, k=1)
return loss, acc
def run_epoch(batch_iter,
exe,
program,
prefix,
model_loss,
model_acc,
epoch,
log_per_step=100):
"""Test"""
batch = 0
total_loss = 0.
total_acc = 0.
total_sample = 0
start = time.time()
for batch_feed_dict in batch_iter():
batch += 1
batch_loss, batch_acc = exe.run(program,
fetch_list=[model_loss, model_acc],
feed=batch_feed_dict)
if batch % log_per_step == 0:
log.info("Batch %s %s-Loss %s %s-Acc %s" %
(batch, prefix, batch_loss, prefix, batch_acc))
num_samples = len(batch_feed_dict["node_index"])
total_loss += batch_loss * num_samples
total_acc += batch_acc * num_samples
total_sample += num_samples
end = time.time()
log.info("%s Epoch %s Loss %.5lf Acc %.5lf Speed(per batch) %.5lf sec" %
(prefix, epoch, total_loss / total_sample,
total_acc / total_sample, (end - start) / batch))
def main(args):
"""Test """
data = load_data(args.normalize, args.symmetry, args.scale)
log.info("preprocess finish")
log.info("Train Examples: %s" % len(data["train_index"]))
log.info("Val Examples: %s" % len(data["val_index"]))
log.info("Test Examples: %s" % len(data["test_index"]))
log.info("Num nodes %s" % data["graph"].num_nodes)
log.info("Num edges %s" % data["graph"].num_edges)
log.info("Average Degree %s" % np.mean(data["graph"].indegree()))
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
samples = []
if args.samples_1 > 0:
samples.append(args.samples_1)
if args.samples_2 > 0:
samples.append(args.samples_2)
with fluid.program_guard(train_program, startup_program):
graph_wrapper = pgl.graph_wrapper.GraphWrapper(
"sub_graph",
fluid.CPUPlace(),
node_feat=data['graph'].node_feat_info())
model_loss, model_acc = build_graph_model(
graph_wrapper,
num_class=data["num_class"],
feature=graph_wrapper.node_feat["feature"],
hidden_size=args.hidden_size,
graphsage_type=args.graphsage_type,
k_hop=len(samples))
test_program = train_program.clone(for_test=True)
train_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['train_index'],
node_label=data["train_label"])
val_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['val_index'],
node_label=data["val_label"])
test_iter = reader.multiprocess_graph_reader(
data['graph'],
graph_wrapper,
samples=samples,
num_workers=args.sample_workers,
batch_size=args.batch_size,
node_index=data['test_index'],
node_label=data["test_label"])
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=args.lr)
adam.minimize(model_loss)
exe = fluid.Executor(place)
exe.run(startup_program)
for epoch in range(args.epoch):
run_epoch(
train_iter,
program=train_program,
exe=exe,
prefix="train",
model_loss=model_loss,
model_acc=model_acc,
epoch=epoch)
run_epoch(
val_iter,
program=test_program,
exe=exe,
prefix="val",
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
run_epoch(
test_iter,
program=test_program,
prefix="test",
exe=exe,
model_loss=model_loss,
model_acc=model_acc,
log_per_step=10000,
epoch=epoch)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='graphsage')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
"--normalize", action='store_true', help="normalize features")
parser.add_argument(
"--symmetry", action='store_true', help="undirect graph")
parser.add_argument("--graphsage_type", type=str, default="graphsage_mean")
parser.add_argument("--sample_workers", type=int, default=5)
parser.add_argument("--epoch", type=int, default=10)
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--batch_size", type=int, default=128)
parser.add_argument("--lr", type=float, default=0.01)
parser.add_argument("--samples_1", type=int, default=25)
parser.add_argument("--samples_2", type=int, default=10)
parser.add_argument("--scale", type=int, default=1)
args = parser.parse_args()
log.info(args)
main(args)
# LINE: Large-scale Information Network Embedding
[LINE](http://www.www2015.it/documents/proceedings/proceedings/p1067.pdf) is an algorithmic framework for embedding very large-scale information networks. It is suitable to a variety of networks including directed, undirected, binary or weighted edges. Based on PGL, we reproduce LINE algorithms and reach the same level of indicators as the paper.
## Datasets
[Flickr network](http://socialnetworks.mpi-sws.org/data-imc2007.html) is a social network, which contains 1715256 nodes and 22613981 edges.
You can dowload data from [here](http://socialnetworks.mpi-sws.org/data-imc2007.html).
Flickr network contains four files:
* flickr-groupmemberships.txt.gz
* flickr-groups.txt.gz
* flickr-links.txt.gz
* flickr-users.txt.gz
After downloading the data,uncompress them, let's say, in **./data/flickr/** . Note that the current directory is the root directory of LINE model.
Then you can run the below command to preprocess the data.
```sh
python data_process.py
```
Then it will produce three files in **./data/flickr/** directory:
* nodes.txt
* edges.txt
* nodes_label.txt
## Dependencies
- paddlepaddle>=1.6
- pgl
## How to run
For examples, use gpu to train LINE on Flickr dataset.
```sh
# multiclass task example
python line.py --use_cuda --order first_order --data_path ./data/flickr/ --save_dir ./checkpoints/model/
python multi_class.py --ckpt_path ./checkpoints/model/model_epoch_20 --percent 0.5
```
## Hyperparameters
- -use_cuda: Use gpu if assign use_cuda.
- -order: LINE with First_order Proximity or Second_order Proximity
- -percent: The percentage of data as training data
### Experiment results
Dataset|model|Task|Metric|PGL Result|Reported Result
--|--|--|--|--|--
Flickr|LINE with first_order|multi-label classification|MacroF1|0.626|0.627
Flickr|LINE with first_order|multi-label classification|MicroF1|0.637|0.639
Flickr|LINE with second_order|multi-label classification|MacroF1|0.615|0.621
Flickr|LINE with second_order|multi-label classification|MicroF1|0.630|0.635
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file provides the Dataset for LINE model.
"""
import os
import io
import sys
import numpy as np
from pgl import graph
from pgl.utils.logger import log
class FlickrDataset(object):
"""Flickr dataset implementation
Args:
name: The name of the dataset.
symmetry_edges: Whether to create symmetry edges.
self_loop: Whether to contain self loop edges.
train_percentage: The percentage of nodes to be trained in multi class task.
Attributes:
graph: The :code:`Graph` data object.
num_groups: Number of classes.
train_index: The index for nodes in training set.
test_index: The index for nodes in validation set.
"""
def __init__(self,
data_path,
symmetry_edges=False,
self_loop=False,
train_percentage=0.5):
self.path = data_path
# self.name = name
self.num_groups = 5
self.symmetry_edges = symmetry_edges
self.self_loop = self_loop
self.train_percentage = train_percentage
self._load_data()
def _load_data(self):
edge_path = os.path.join(self.path, 'edges.txt')
node_path = os.path.join(self.path, 'nodes.txt')
nodes_label_path = os.path.join(self.path, 'nodes_label.txt')
all_edges = []
edges_weight = []
with io.open(node_path) as inf:
num_nodes = len(inf.readlines())
node_feature = np.zeros((num_nodes, self.num_groups))
with io.open(nodes_label_path) as inf:
for line in inf:
# group_id means the label of the node
node_id, group_id = line.strip('\n').split(',')
node_id = int(node_id) - 1
labels = group_id.split(' ')
for i in labels:
node_feature[node_id][int(i) - 1] = 1
node_degree_list = [1 for _ in range(num_nodes)]
with io.open(edge_path) as inf:
for line in inf:
items = line.strip().split('\t')
if len(items) == 2:
u, v = int(items[0]), int(items[1])
weight = 1 # binary weight, default set to 1
else:
u, v, weight = int(items[0]), int(items[1]), float(items[
2]),
u, v = u - 1, v - 1
all_edges.append((u, v))
edges_weight.append(weight)
if self.symmetry_edges:
all_edges.append((v, u))
edges_weight.append(weight)
# sum the weights of the same node as the outdegree
node_degree_list[u] += weight
if self.self_loop:
for i in range(num_nodes):
all_edges.append((i, i))
edges_weight.append(1.)
all_edges = list(set(all_edges))
self.graph = graph.Graph(
num_nodes=num_nodes,
edges=all_edges,
node_feat={"group_id": node_feature})
perm = np.arange(0, num_nodes)
np.random.shuffle(perm)
train_num = int(num_nodes * self.train_percentage)
self.train_index = perm[:train_num]
self.test_index = perm[train_num:]
edge_distribution = np.array(edges_weight, dtype=np.float32)
self.edge_distribution = edge_distribution / np.sum(edge_distribution)
self.edge_sampling = AliasSampling(prob=edge_distribution)
node_dist = np.array(node_degree_list, dtype=np.float32)
node_negative_distribution = np.power(node_dist, 0.75)
self.node_negative_distribution = node_negative_distribution / np.sum(
node_negative_distribution)
self.node_sampling = AliasSampling(prob=node_negative_distribution)
self.node_index = {}
self.node_index_reversed = {}
for index, e in enumerate(self.graph.edges):
self.node_index[e[0]] = index
self.node_index_reversed[index] = e[0]
def fetch_batch(self,
batch_size=16,
K=10,
edge_sampling='alias',
node_sampling='alias'):
"""Fetch batch data from dataset.
"""
if edge_sampling == 'numpy':
edge_batch_index = np.random.choice(
self.graph.num_edges,
size=batch_size,
p=self.edge_distribution)
elif edge_sampling == 'alias':
edge_batch_index = self.edge_sampling.sampling(batch_size)
elif edge_sampling == 'uniform':
edge_batch_index = np.random.randint(
0, self.graph.num_edges, size=batch_size)
u_i = []
u_j = []
label = []
for edge_index in edge_batch_index:
edge = self.graph.edges[edge_index]
u_i.append(edge[0])
u_j.append(edge[1])
label.append(1)
for i in range(K):
while True:
if node_sampling == 'numpy':
negative_node = np.random.choice(
self.graph.num_nodes,
p=self.node_negative_distribution)
elif node_sampling == 'alias':
negative_node = self.node_sampling.sampling()
elif node_sampling == 'uniform':
negative_node = np.random.randint(0,
self.graph.num_nodes)
# make sure the sampled node has no edge with the source node
if not self.graph.has_edges_between(
np.array(
[self.node_index_reversed[negative_node]]),
np.array([self.node_index_reversed[edge[0]]])):
break
u_i.append(edge[0])
u_j.append(negative_node)
label.append(-1)
u_i = np.array([u_i], dtype=np.int64).T
u_j = np.array([u_j], dtype=np.int64).T
label = np.array(label, dtype=np.float32)
return u_i, u_j, label
class AliasSampling:
"""Implemention of Alias-Method
This is an implementation of Alias-Method for sampling efficiently from
a discrete probability distribution.
Reference: https://en.wikipedia.org/wiki/Alias_method
Args:
prob: The discrete probability distribution.
"""
def __init__(self, prob):
self.n = len(prob)
self.U = np.array(prob) * self.n
self.K = [i for i in range(len(prob))]
overfull, underfull = [], []
for i, U_i in enumerate(self.U):
if U_i > 1:
overfull.append(i)
elif U_i < 1:
underfull.append(i)
while len(overfull) and len(underfull):
i, j = overfull.pop(), underfull.pop()
self.K[j] = i
self.U[i] = self.U[i] - (1 - self.U[j])
if self.U[i] > 1:
overfull.append(i)
elif self.U[i] < 1:
underfull.append(i)
def sampling(self, n=1):
"""Sampling.
"""
x = np.random.rand(n)
i = np.floor(self.n * x)
y = self.n * x - i
i = i.astype(np.int64)
res = [i[k] if y[k] < self.U[i[k]] else self.K[i[k]] for k in range(n)]
if n == 1:
return res[0]
else:
return res
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file preprocess the FlickrDataset for LINE model.
"""
import argparse
import operator
import os
def process_data(groupsMemberships_file, flickr_links_file, users_label_file,
edges_file, users_file):
"""Preprocess flickr network dataset.
Args:
groupsMemberships_file: flickr-groupmemberships.txt file,
each line is a pair (user, group), which indicates a user belongs to a group.
flickr_links_file: flickr-links.txt file,
each line is a pair (user, user), which indicates
the two users have a relationship.
users_label_file: each line is a pair (user, list of group),
each user may belong to multiple groups.
edges_file: each line is a pair (user, user), which indicates
the two users have a relationship. It filters some unused edges.
users_file: each line is a int number, which indicates the ID of a user.
"""
group2users = {}
with open(groupsMemberships_file, 'r') as f:
for line in f:
user, group = line.strip().split()
try:
group2users[int(group)].append(user)
except:
group2users[int(group)] = [user]
# counting how many users belong to every group
group2usersNum = {}
for key, item in group2users.items():
group2usersNum[key] = len(item)
groups_sorted_by_usersNum = sorted(
group2usersNum.items(), key=operator.itemgetter(1), reverse=True)
# the paper only need the 5 groups with the largest number of users
label = 1 # remapping the 5 groups from 1 to 5
users_label = {}
for i in range(5):
users_list = group2users[groups_sorted_by_usersNum[i][0]]
for user in users_list:
# one user may have multi-labels
try:
users_label[user].append(label)
except:
users_label[user] = [label]
label += 1
# remapping the users IDs to make the IDs from 0 to N
userID2nodeID = {}
count = 1
for key in sorted(users_label.keys()):
userID2nodeID[key] = count
count += 1
with open(users_label_file, 'w') as writer:
for key in sorted(users_label.keys()):
line = ' '.join([str(i) for i in users_label[key]])
writer.write(str(userID2nodeID[key]) + ',' + line + '\n')
# produce edges file
with open(flickr_links_file, 'r') as reader, open(edges_file,
'w') as writer:
for line in reader:
src, dst = line.strip().split('\t')
# filter unused user IDs
if src in users_label and dst in users_label:
# remapping the users IDs
src = userID2nodeID[src]
dst = userID2nodeID[dst]
writer.write(str(src) + '\t' + str(dst) + '\n')
# produce nodes file
with open(users_file, 'w') as writer:
for i in range(1, 1 + len(userID2nodeID)):
writer.write(str(i) + '\n')
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='LINE')
parser.add_argument(
'--groupmemberships',
type=str,
default='./data/flickr/flickr-groupmemberships.txt',
help='groupmemberships of flickr dataset')
parser.add_argument(
'--flickr_links',
type=str,
default='./data/flickr/flickr-links.txt',
help='the flickr-links.txt file for training')
parser.add_argument(
'--nodes_label',
type=str,
default='./data/flickr/nodes_label.txt',
help='nodes (users) label file for training')
parser.add_argument(
'--edges',
type=str,
default='./data/flickr/edges.txt',
help='the result edges (links) file for training')
parser.add_argument(
'--nodes',
type=str,
default='./data/flickr/nodes.txt',
help='the nodes (users) file for training')
args = parser.parse_args()
process_data(args.groupmemberships, args.flickr_links, args.nodes_label,
args.edges, args.nodes)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the training process of LINE model.
"""
import time
import argparse
import random
import os
import numpy as np
import pgl
import paddle.fluid as fluid
import paddle.fluid.layers as fl
from pgl.utils.logger import log
from data_loader import FlickrDataset
def make_dir(path):
"""Create directory if path is not existed.
Args:
path: The directory that wants to create.
"""
try:
os.makedirs(path)
except:
if not os.path.isdir(path):
raise
def save_param(dirname, var_name_list):
"""save_param"""
if not os.path.exists(dirname):
os.makedirs(dirname)
for var_name in var_name_list:
var = fluid.global_scope().find_var(var_name)
var_tensor = var.get_tensor()
np.save(os.path.join(dirname, var_name + '.npy'), np.array(var_tensor))
def set_seed(seed):
"""Set global random seed.
"""
random.seed(seed)
np.random.seed(seed)
def build_model(args, graph):
"""Build LINE model.
Args:
args: The hyperparameters for configure.
graph: The :code:`Graph` data object.
"""
u_i = fl.data(
name='u_i', shape=[None, 1], dtype='int64', append_batch_size=False)
u_j = fl.data(
name='u_j', shape=[None, 1], dtype='int64', append_batch_size=False)
label = fl.data(
name='label', shape=[None], dtype='float32', append_batch_size=False)
lr = fl.data(
name='learning_rate',
shape=[1],
dtype='float32',
append_batch_size=False)
u_i_embed = fl.embedding(
input=u_i,
size=[graph.num_nodes, args.embed_dim],
param_attr='shared_w')
if args.order == 'first_order':
u_j_embed = fl.embedding(
input=u_j,
size=[graph.num_nodes, args.embed_dim],
param_attr='shared_w')
elif args.order == 'second_order':
u_j_embed = fl.embedding(
input=u_j,
size=[graph.num_nodes, args.embed_dim],
param_attr='context_w')
else:
raise ValueError("order should be first_order or second_order, not %s"
% (args.order))
inner_product = fl.reduce_sum(u_i_embed * u_j_embed, dim=1)
loss = -1 * fl.reduce_mean(fl.logsigmoid(label * inner_product))
optimizer = fluid.optimizer.RMSPropOptimizer(learning_rate=lr)
train_op = optimizer.minimize(loss)
return loss, optimizer
def main(args):
"""The main funciton for training LINE model.
"""
make_dir(args.save_dir)
set_seed(args.seed)
dataset = FlickrDataset(args.data_path)
log.info('num nodes in graph: %d' % dataset.graph.num_nodes)
log.info('num edges in graph: %d' % dataset.graph.num_edges)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
main_program = fluid.default_main_program()
startup_program = fluid.default_startup_program()
# build model here
with fluid.program_guard(main_program, startup_program):
loss, opt = build_model(args, dataset.graph)
exe = fluid.Executor(place)
exe.run(startup_program) #initialize the parameters of the network
batchrange = int(dataset.graph.num_edges / args.batch_size)
T = batchrange * args.epochs
for epoch in range(args.epochs):
for b in range(batchrange):
lr = max(args.lr * (1 - (batchrange * epoch + b) / T), 0.0001)
u_i, u_j, label = dataset.fetch_batch(
batch_size=args.batch_size,
K=args.neg_sample_size,
edge_sampling=args.sample_method,
node_sampling=args.sample_method)
feed_dict = {
'u_i': u_i,
'u_j': u_j,
'label': label,
'learning_rate': lr
}
ret_loss = exe.run(main_program,
feed=feed_dict,
fetch_list=[loss],
return_numpy=True)
if b % 500 == 0:
log.info("Epoch %d | Step %d | Loss %f | lr: %f" %
(epoch, b, ret_loss[0], lr))
# save parameters in every epoch
log.info("saving persistables parameters...")
cur_save_path = os.path.join(args.save_dir,
"model_epoch_%d" % (epoch + 1))
save_param(cur_save_path, ['shared_w'])
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='LINE')
parser.add_argument(
'--data_path',
type=str,
default='./data/flickr/',
help='dataset for training')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument("--epochs", type=int, default=20, help='total epochs')
parser.add_argument("--seed", type=int, default=1667, help='random seed')
parser.add_argument("--lr", type=float, default=0.01, help='learning rate')
parser.add_argument(
"--neg_sample_size",
type=int,
default=5,
help='negative samplle number')
parser.add_argument("--save_dir", type=str, default="./checkpoints/model")
parser.add_argument("--batch_size", type=int, default=32)
parser.add_argument(
"--embed_dim",
type=int,
default=128,
help='the dimension of node embedding')
parser.add_argument(
"--sample_method",
type=str,
default="alias",
help='negative sample method (uniform, numpy, alias)')
parser.add_argument(
"--order",
type=str,
default="first_order",
help='the order of neighbors (first_order, second_order)')
args = parser.parse_args()
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file provides the multi class task for testing the embedding
learned by LINE model.
"""
import argparse
import time
import math
import os
import random
import numpy as np
import sklearn.metrics
from sklearn.metrics import f1_score
import pgl
from pgl.utils import op
import paddle.fluid as fluid
import paddle.fluid.layers as l
from pgl.utils.logger import log
from data_loader import FlickrDataset
def load_param(dirname, var_name_list):
"""load_param"""
for var_name in var_name_list:
var = fluid.global_scope().find_var(var_name)
var_tensor = var.get_tensor()
var_tmp = np.load(os.path.join(dirname, var_name + '.npy'))
var_tensor.set(var_tmp, fluid.CPUPlace())
def set_seed(seed):
"""Set global random seed.
"""
random.seed(seed)
np.random.seed(seed)
def node_classify_model(graph,
num_labels,
embed_dim=16,
name='node_classify_task'):
"""Build node classify model.
Args:
graph: The :code:`Graph` data object.
num_labels: The number of labels.
embed_dim: The dimension of embedding.
name: The name of the model.
"""
pyreader = l.py_reader(
capacity=70,
shapes=[[-1, 1], [-1, num_labels]],
dtypes=['int64', 'float32'],
lod_levels=[0, 0],
name=name + '_pyreader',
use_double_buffer=True)
nodes, labels = l.read_file(pyreader)
embed_nodes = l.embedding(
input=nodes, size=[graph.num_nodes, embed_dim], param_attr='shared_w')
embed_nodes.stop_gradient = True
logits = l.fc(input=embed_nodes, size=num_labels)
loss = l.sigmoid_cross_entropy_with_logits(logits, labels)
loss = l.reduce_mean(loss)
prob = l.sigmoid(logits)
topk = l.reduce_sum(labels, -1)
return {
'pyreader': pyreader,
'loss': loss,
'prob': prob,
'labels': labels,
'topk': topk
}
# return pyreader, loss, prob, labels, topk
def node_classify_generator(graph,
all_nodes=None,
batch_size=512,
epoch=1,
shuffle=True):
"""Data generator for node classify.
Args:
graph: The :code:`Graph` data object.
all_nodes: the total number of nodes.
batch_size: batch size for training.
epoch: The number of epochs.
shuffle: Random shuffle of data.
"""
if all_nodes is None:
all_nodes = np.arange(graph.num_nodes)
def batch_nodes_generator(shuffle=shuffle):
"""Batch nodes generator.
"""
perm = np.arange(len(all_nodes), dtype=np.int64)
if shuffle:
np.random.shuffle(perm)
start = 0
while start < len(all_nodes):
yield all_nodes[perm[start:start + batch_size]]
start += batch_size
def wrapper():
"""Wrapper function.
"""
for _ in range(epoch):
for batch_nodes in batch_nodes_generator():
batch_nodes_expanded = np.expand_dims(batch_nodes,
-1).astype(np.int64)
batch_labels = graph.node_feat['group_id'][batch_nodes].astype(
np.float32)
yield [batch_nodes_expanded, batch_labels]
return wrapper
def topk_f1_score(labels,
probs,
topk_list=None,
average="macro",
threshold=None):
"""Calculate top K F1 score.
"""
assert topk_list is not None or threshold is not None, "one of topklist and threshold should not be None"
if threshold is not None:
preds = probs > threshold
else:
preds = np.zeros_like(labels, dtype=np.int64)
for idx, (prob, topk) in enumerate(zip(np.argsort(probs), topk_list)):
preds[idx][prob[-int(topk):]] = 1
return f1_score(labels, preds, average=average)
def main(args):
"""The main funciton for nodes classify task.
"""
set_seed(args.seed)
log.info(args)
dataset = FlickrDataset(args.data_path, train_percentage=args.percent)
train_steps = (len(dataset.train_index) // args.batch_size) * args.epochs
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_prog = fluid.Program()
test_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
with fluid.unique_name.guard():
train_model = node_classify_model(
dataset.graph,
dataset.num_groups,
embed_dim=args.embed_dim,
name='train')
lr = l.polynomial_decay(args.lr, train_steps, 0.0001)
adam = fluid.optimizer.Adam(lr)
adam.minimize(train_model['loss'])
with fluid.program_guard(test_prog, startup_prog):
with fluid.unique_name.guard():
test_model = node_classify_model(
dataset.graph,
dataset.num_groups,
embed_dim=args.embed_dim,
name='test')
test_prog = test_prog.clone(for_test=True)
exe = fluid.Executor(place)
exe.run(startup_prog)
train_model['pyreader'].decorate_tensor_provider(
node_classify_generator(
dataset.graph,
dataset.train_index,
batch_size=args.batch_size,
epoch=args.epochs))
test_model['pyreader'].decorate_tensor_provider(
node_classify_generator(
dataset.graph,
dataset.test_index,
batch_size=args.batch_size,
epoch=1))
def existed_params(var):
"""existed_params
"""
if not isinstance(var, fluid.framework.Parameter):
return False
return os.path.exists(os.path.join(args.ckpt_path, var.name))
log.info('loading pretrained parameters from npy')
load_param(args.ckpt_path, ['shared_w'])
step = 0
prev_time = time.time()
train_model['pyreader'].start()
final_macro_f1 = 0.0
final_micro_f1 = 0.0
while 1:
try:
train_loss_val, train_probs_val, train_labels_val, train_topk_val = exe.run(
train_prog,
fetch_list=[
train_model['loss'], train_model['prob'],
train_model['labels'], train_model['topk']
],
return_numpy=True)
train_macro_f1 = topk_f1_score(train_labels_val, train_probs_val,
train_topk_val, "macro",
args.threshold)
train_micro_f1 = topk_f1_score(train_labels_val, train_probs_val,
train_topk_val, "micro",
args.threshold)
step += 1
log.info("Step %d " % step + "Train Loss: %f " % train_loss_val +
"Train Macro F1: %f " % train_macro_f1 +
"Train Micro F1: %f " % train_micro_f1)
except fluid.core.EOFException:
train_model['pyreader'].reset()
break
test_model['pyreader'].start()
test_probs_vals, test_labels_vals, test_topk_vals = [], [], []
while 1:
try:
test_loss_val, test_probs_val, test_labels_val, test_topk_val = exe.run(
test_prog,
fetch_list=[
test_model['loss'], test_model['prob'],
test_model['labels'], test_model['topk']
],
return_numpy=True)
test_probs_vals.append(
test_probs_val), test_labels_vals.append(test_labels_val)
test_topk_vals.append(test_topk_val)
except fluid.core.EOFException:
test_model['pyreader'].reset()
test_probs_array = np.concatenate(test_probs_vals)
test_labels_array = np.concatenate(test_labels_vals)
test_topk_array = np.concatenate(test_topk_vals)
test_macro_f1 = topk_f1_score(
test_labels_array, test_probs_array, test_topk_array,
"macro", args.threshold)
test_micro_f1 = topk_f1_score(
test_labels_array, test_probs_array, test_topk_array,
"micro", args.threshold)
log.info("\t\tStep %d " % step + "Test Loss: %f " %
test_loss_val + "Test Macro F1: %f " % test_macro_f1 +
"Test Micro F1: %f " % test_micro_f1)
final_macro_f1 = max(test_macro_f1, final_macro_f1)
final_micro_f1 = max(test_micro_f1, final_micro_f1)
break
log.info("\nFinal test Macro F1: %f " % final_macro_f1 +
"Final test Micro F1: %f " % final_micro_f1)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='LINE')
parser.add_argument(
'--data_path',
type=str,
default='./data/flickr/',
help='dataset for training')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument("--epochs", type=int, default=5)
parser.add_argument("--seed", type=int, default=1667)
parser.add_argument(
"--lr", type=float, default=0.025, help='learning rate')
parser.add_argument("--embed_dim", type=int, default=128)
parser.add_argument("--batch_size", type=int, default=256)
parser.add_argument("--threshold", type=float, default=None)
parser.add_argument(
"--percent",
type=float,
default=0.5,
help="the percentage of data as training data")
parser.add_argument(
"--ckpt_path", type=str, default="./checkpoints/model/model_epoch_0/")
args = parser.parse_args()
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file loads and preprocesses the dataset for metapath2vec model.
"""
import sys
import os
import glob
import numpy as np
import tqdm
import time
import logging
import random
from pgl import heter_graph
import pickle as pkl
class Dataset(object):
"""Implementation of Dataset class
This is a simple implementation of loading and processing dataset for metapath2vec model.
Args:
config: dict, some configure parameters.
"""
NEGATIVE_TABLE_SIZE = 1e8
def __init__(self, config):
self.config = config
self.walk_files = os.path.join(config['input_path'],
config['walk_path'])
self.word2id_file = os.path.join(config['input_path'],
config['word2id_file'])
self.word2freq = {}
self.word2id = {}
self.id2word = {}
self.sentences_count = 0
self.token_count = 0
self.negatives = []
self.discards = []
logging.info('reading sentences')
self.read_words()
logging.info('initializing discards')
self.initDiscards()
logging.info('initializing negatives')
self.initNegatives()
def read_words(self):
"""Read words(nodes) from walk files which are produced by sampler.
"""
word_freq = dict()
for walk_file in glob.glob(self.walk_files):
with open(walk_file, 'r') as reader:
for walk in reader:
walk = walk.strip().split()
if len(walk) > 1:
self.sentences_count += 1
for word in walk:
if int(word) >= self.config[
'paper_start_index']: # remove paper
continue
else:
self.token_count += 1
word_freq[word] = word_freq.get(word, 0) + 1
wid = 0
logging.info('Read %d sentences.' % self.sentences_count)
logging.info('Read %d words.' % self.token_count)
logging.info('%d words have been sampled.' % len(word_freq))
for w, c in word_freq.items():
if c < self.config['min_count']:
continue
self.word2id[w] = wid
self.id2word[wid] = w
self.word2freq[wid] = c
wid += 1
self.word_count = len(self.word2id)
logging.info(
'%d words displayed less than %d(min_count) have been discarded.' %
(len(word_freq) - len(self.word2id), self.config['min_count']))
pkl.dump(self.word2id, open(self.word2id_file, 'wb'))
def initDiscards(self):
"""Get a frequency table for sub-sampling.
"""
t = 0.0001
f = np.array(list(self.word2freq.values())) / self.token_count
self.discards = np.sqrt(t / f) + (t / f)
def initNegatives(self):
"""Get a table for negative sampling
"""
pow_freq = np.array(list(self.word2freq.values()))**0.75
words_pow = sum(pow_freq)
ratio = pow_freq / words_pow
count = np.round(ratio * Dataset.NEGATIVE_TABLE_SIZE)
for wid, c in enumerate(count):
self.negatives += [wid] * int(c)
self.negatives = np.array(self.negatives)
np.random.shuffle(self.negatives)
self.sampling_prob = ratio
def getNegatives(self, size):
"""Get negative samples from negative samling table.
"""
return np.random.choice(self.negatives, size)
def walk_from_files(self, walkpath_files):
"""Generate walks from files.
"""
bucket = []
for filename in walkpath_files:
with open(filename) as reader:
for line in reader:
words = line.strip().split()
words = [
w for w in words
if int(w) < self.config['paper_start_index']
]
if len(words) > 1:
word_ids = [
self.word2id[w] for w in words if w in self.word2id
]
bucket.append(word_ids)
if len(bucket) == self.config['batch_size']:
yield bucket
bucket = []
if len(bucket):
yield bucket
def pairs_generator(self, walkpath_files):
"""Generate train pairs(src, pos, negs) for training model.
"""
def wrapper():
"""wrapper for multiprocess calling.
"""
for walks in self.walk_from_files(walkpath_files):
res = self.gen_pairs(walks)
yield res
return wrapper
def gen_pairs(self, walks):
"""Generate train pairs data for training model.
"""
src = []
pos = []
negs = []
skip_window = self.config['win_size'] // 2
for walk in walks:
for i in range(len(walk)):
for j in range(1, skip_window + 1):
if i - j >= 0:
src.append(walk[i])
pos.append(walk[i - j])
negs.append(
self.getNegatives(size=self.config['neg_num']))
if i + j < len(walk):
src.append(walk[i])
pos.append(walk[i + j])
negs.append(
self.getNegatives(size=self.config['neg_num']))
src = np.array(src, dtype=np.int64).reshape(-1, 1, 1)
pos = np.array(pos, dtype=np.int64).reshape(-1, 1, 1)
negs = np.expand_dims(np.array(negs, dtype=np.int64), -1)
return {"src": src, "pos": pos, "negs": negs}
if __name__ == "__main__":
config = {
'input_path': './data/out_aminer_CPAPC/',
'walk_path': 'aminer_walks_CPAPC_500num_100len/*',
'author_label_file': 'author_label.txt',
'venue_label_file': 'venue_label.txt',
'remapping_author_label_file': 'multi_class_author_label.txt',
'remapping_venue_label_file': 'multi_class_venue_label.txt',
'word2id_file': 'word2id.pkl',
'win_size': 7,
'neg_num': 5,
'min_count': 2,
'batch_size': 1,
}
log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s'
logging.basicConfig(level=getattr(logging, 'INFO'), format=log_format)
dataset = Dataset(config)
# metapath2vec: Scalable Representation Learning for Heterogeneous Networks
[metapath2vec](https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf) is a algorithm framework for representation learning in heterogeneous networks which contains multiple types of nodes and links. Given a heterogeneous graph, metapath2vec algorithm first generates meta-path-based random walks and then use skipgram model to train a language model. Based on PGL, we reproduce metapath2vec algorithm.
## Datasets
You can dowload datasets from [here](https://ericdongyx.github.io/metapath2vec/m2v.html)
We use the "aminer" data for example. After downloading the aminer data, put them, let's say, in ./data/net_aminer/ . We also need to put "label/" directory in ./data/.
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
## Hyperparameters
All the hyper parameters are saved in config.yaml file. So before training, you can open the config.yaml to modify the hyper parameters as you like.
for example, you can change the \"use_cuda\" to \"True \" in order to use GPU for training or modify \"data_path\" to specify the data you want.
Some important hyper parameters in config.yaml:
- **use_cuda**: use GPU to train model
- **data_path**: the directory of dataset that you want to load
- **lr**: learning rate
- **neg_num**: number of negative samples.
- **num_walks**: number of walks started from each node
- **walk_length**: walk length
- **metapath**: meta path scheme
## Metapath randomwalk sampling
Before training, we should generate some metapath random walks to train skipgram model. we can run the below command to produce metapath randomwalk data.
```sh
python sample.py -c config.yaml
```
## Training and Testing
After finishing metapath randomwalk sampling, you can run the below command to train and test the model.
```sh
python main.py -c config.yaml
python multi_class.py --dataset ./data/out_aminer_CPAPC/author_label.txt --word2id ./checkpoints/train.metapath2vec/word2id.pkl --ckpt_path ./checkpoints/train.metapath2vec/model_epoch5/
```
## Experiment results
| train_percent | Metric | PGL Result | Reported Result |
|---------------|----------|------------|-----------------|
| 50% | macro-F1 | 0.9249 | 0.9314 |
| 50% | micro-F1 | 0.9283 | 0.9365 |
task_name: train.metapath2vec
use_cuda: True
log_level: info
seed: 1667
sampler:
type:
args:
data_path: ./data/net_aminer/
author_label_file: ./data/label/googlescholar.8area.author.label.txt
venue_label_file: ./data/label/googlescholar.8area.venue.label.txt
output_path: ./data/out_aminer_CPAPC/
new_author_label_file: author_label.txt
new_venue_label_file: venue_label.txt
walk_saved_path: walks/
walk_batch_size: 1000
num_walks: 1000
walk_length: 100
num_sample_workers: 16
first_node_type: conf
metapath: c2p-p2a-a2p-p2c #conf-paper-author-paper-conf
optimizer:
type: Adam
args:
lr: 0.005
end_lr: 0.0001
trainer:
type: trainer
args:
epochs: 5
log_dir: logs/
save_dir: checkpoints/
output_dir: outputs/
num_sample_workers: 8
data_loader:
type: Dataset
args:
input_path: ./data/out_aminer_CPAPC/ # same path as output_path in sampler
walk_path: walks/*
word2id_file: word2id.pkl
batch_size: 32
win_size: 5 # default: 7
neg_num: 5
min_count: 10
paper_start_index: 1697414
model:
type: SkipgramModel
args:
embed_dim: 128
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the training process of metapath2vec model.
"""
import os
import sys
import argparse
import time
import numpy as np
import logging
import pickle as pkl
import shutil
import glob
import pgl
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import paddle.fluid.layers as fl
from utils import *
import Dataset
import model as Models
from pgl.utils import mp_reader
from sklearn.metrics import (auc, f1_score, precision_recall_curve,
roc_auc_score)
def set_seed(seed):
"""Set global random seed."""
random.seed(seed)
np.random.seed(seed)
def save_param(dirname, var_name_list):
"""save_param"""
if not os.path.exists(dirname):
os.makedirs(dirname)
for var_name in var_name_list:
var = fluid.global_scope().find_var(var_name)
var_tensor = var.get_tensor()
np.save(os.path.join(dirname, var_name + '.npy'), np.array(var_tensor))
def multiprocess_data_generator(config, dataset):
"""Using multiprocess to generate training data.
"""
num_sample_workers = config['trainer']['args']['num_sample_workers']
walkpath_files = [[] for i in range(num_sample_workers)]
for idx, f in enumerate(glob.glob(dataset.walk_files)):
walkpath_files[idx % num_sample_workers].append(f)
gen_data_pool = [
dataset.pairs_generator(files) for files in walkpath_files
]
if num_sample_workers == 1:
gen_data_func = gen_data_pool[0]
else:
gen_data_func = mp_reader.multiprocess_reader(
gen_data_pool, use_pipe=True, queue_size=100)
return gen_data_func
def run_epoch(epoch,
config,
data_generator,
train_prog,
model,
feed_dict,
exe,
for_test=False):
"""Run training process of every epoch.
"""
total_loss = []
for idx, batch_data in enumerate(data_generator()):
feed_dict['train_inputs'] = batch_data['src']
feed_dict['train_labels'] = batch_data['pos']
feed_dict['train_negs'] = batch_data['negs']
loss, lr = exe.run(train_prog,
feed=feed_dict,
fetch_list=[model.loss, model.lr],
return_numpy=True)
total_loss.append(loss[0])
if (idx + 1) % 500 == 0:
avg_loss = np.mean(total_loss)
logging.info("epoch %d | step %d | lr %.4f | train_loss %f " %
(epoch, idx + 1, lr, avg_loss))
total_loss = []
def main(config):
"""main function for training metapath2vec model.
"""
logging.info(config)
set_seed(config['seed'])
dataset = getattr(
Dataset, config['data_loader']['type'])(config['data_loader']['args'])
data_generator = multiprocess_data_generator(config, dataset)
# move word2id file to checkpoints directory
src_word2id_file = dataset.word2id_file
dst_wor2id_file = config['trainer']['args']['save_dir'] + config[
'data_loader']['args']['word2id_file']
logging.info('backup word2id file to %s' % dst_wor2id_file)
shutil.move(src_word2id_file, dst_wor2id_file)
place = fluid.CUDAPlace(0) if config['use_cuda'] else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
model = getattr(Models, config['model']['type'])(
dataset=dataset, config=config['model']['args'], place=place)
with fluid.program_guard(train_program, startup_program):
global_steps = int(dataset.sentences_count *
config['trainer']['args']['epochs'] /
config['data_loader']['args']['batch_size'])
model.backward(global_steps, config['optimizer']['args'])
# train
exe = fluid.Executor(place)
exe.run(startup_program)
feed_dict = {}
logging.info('training...')
for epoch in range(1, 1 + config['trainer']['args']['epochs']):
run_epoch(epoch, config['trainer']['args'], data_generator,
train_program, model, feed_dict, exe)
logging.info('saving model...')
cur_save_path = os.path.join(config['trainer']['args']['save_dir'],
"model_epoch%d" % (epoch))
save_param(cur_save_path, ['content'])
logging.info('finishing training')
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='metapath2vec')
parser.add_argument(
'-c',
'--config',
default=None,
type=str,
help='config file path (default: None)')
parser.add_argument(
'-n',
'--taskname',
default=None,
type=str,
help='task name(default: None)')
args = parser.parse_args()
if args.config:
# load config file
config = Config(args.config, isCreate=True, isSave=True)
config = config()
else:
raise AssertionError(
"Configuration file need to be specified. Add '-c config.yaml', for example."
)
log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s'
logging.basicConfig(
level=getattr(logging, config['log_level'].upper()), format=log_format)
main(config)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the skipgram model for training metapath2vec.
"""
import argparse
import time
import math
import os
import io
from multiprocessing import Pool
import logging
import numpy as np
import glob
import pgl
from pgl import data_loader
from pgl.utils import op
from pgl.utils.logger import log
import paddle.fluid as fluid
import paddle.fluid.layers as fl
class SkipgramModel(object):
"""Implemetation of skipgram model.
Args:
config: dict, some configure parameters.
dataset: instance of Dataset class
place: GPU or CPU place
"""
def __init__(self, config, dataset, place):
self.config = config
self.dataset = dataset
self.place = place
self.neg_num = self.dataset.config['neg_num']
self.num_nodes = len(dataset.word2id)
self.train_inputs = fl.data(
'train_inputs', shape=[None, 1, 1], dtype='int64')
self.train_labels = fl.data(
'train_labels', shape=[None, 1, 1], dtype='int64')
self.train_negs = fl.data(
'train_negs', shape=[None, self.neg_num, 1], dtype='int64')
self.forward()
def backward(self, global_steps, opt_config):
"""Build the optimizer.
"""
self.lr = fl.polynomial_decay(opt_config['lr'], global_steps,
opt_config['end_lr'])
adam = fluid.optimizer.Adam(learning_rate=self.lr)
adam.minimize(self.loss)
def forward(self):
"""Build the skipgram model.
"""
initrange = 1.0 / self.config['embed_dim']
embed_init = fluid.initializer.UniformInitializer(
low=-initrange, high=initrange)
weight_init = fluid.initializer.TruncatedNormal(
scale=1.0 / math.sqrt(self.config['embed_dim']))
embed_src = fl.embedding(
input=self.train_inputs,
size=[self.num_nodes, self.config['embed_dim']],
param_attr=fluid.ParamAttr(
name='content', initializer=embed_init))
weight_pos = fl.embedding(
input=self.train_labels,
size=[self.num_nodes, self.config['embed_dim']],
param_attr=fluid.ParamAttr(
name='weight', initializer=weight_init))
weight_negs = fl.embedding(
input=self.train_negs,
size=[self.num_nodes, self.config['embed_dim']],
param_attr=fluid.ParamAttr(
name='weight', initializer=weight_init))
pos_logits = fl.matmul(
embed_src, weight_pos, transpose_y=True) # [batch_size, 1, 1]
pos_score = fl.squeeze(pos_logits, axes=[1])
pos_score = fl.clip(pos_score, min=-10, max=10)
pos_score = -self.neg_num * fl.logsigmoid(pos_score)
neg_logits = fl.matmul(
embed_src, weight_negs,
transpose_y=True) # [batch_size, 1, neg_num]
neg_score = fl.squeeze(neg_logits, axes=[1])
neg_score = fl.clip(neg_score, min=-10, max=10)
neg_score = -1.0 * fl.logsigmoid(-1.0 * neg_score)
neg_score = fl.reduce_sum(neg_score, dim=1, keep_dim=True)
self.loss = fl.reduce_mean(pos_score + neg_score) / self.neg_num / 2
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file provides the multi class task for testing the embedding learned by metapath2vec model.
"""
import argparse
import sys
import os
import tqdm
import time
import math
import logging
import random
import pickle as pkl
import numpy as np
import sklearn.metrics
from sklearn.metrics import f1_score
import pgl
import paddle.fluid as fluid
import paddle.fluid.layers as fl
import Dataset
from utils import *
def load_param(dirname, var_name_list):
"""load_param"""
for var_name in var_name_list:
var = fluid.global_scope().find_var(var_name)
var_tensor = var.get_tensor()
var_tmp = np.load(os.path.join(dirname, var_name + '.npy'))
var_tensor.set(var_tmp, fluid.CPUPlace())
def load_data(file_, word2id):
"""Load data for node classification.
"""
words_label = []
line_count = 0
with open(file_, 'r') as reader:
for line in reader:
line_count += 1
tokens = line.strip().split(' ')
word, label = tokens[0], int(tokens[1]) - 1
if word in word2id:
words_label.append((word2id[word], label))
words_label = np.array(words_label, dtype=np.int64)
np.random.shuffle(words_label)
logging.info('%d/%d word_label pairs have been loaded' %
(len(words_label), line_count))
return words_label
def node_classify_model(word2id, num_labels, embed_dim=16):
"""Build node classify model.
Args:
word2id(dict): map word(node) to its corresponding index
num_labels: The number of labels.
embed_dim: The dimension of embedding.
"""
nodes = fl.data('nodes', shape=[None, 1], dtype='int64')
labels = fl.data('labels', shape=[None, 1], dtype='int64')
embed_nodes = fl.embedding(
input=nodes,
size=[len(word2id), embed_dim],
param_attr=fluid.ParamAttr(name='content'))
embed_nodes.stop_gradient = True
probs = fl.fc(input=embed_nodes, size=num_labels, act='softmax')
predict = fl.argmax(probs, axis=-1)
loss = fl.cross_entropy(input=probs, label=labels)
loss = fl.reduce_mean(loss)
return {
'loss': loss,
'probs': probs,
'predict': predict,
'labels': labels,
}
def run_epoch(exe, prog, model, feed_dict, lr):
"""Run training process of every epoch.
"""
if lr is None:
loss, predict = exe.run(prog,
feed=feed_dict,
fetch_list=[model['loss'], model['predict']],
return_numpy=True)
lr_ = 0
else:
loss, predict, lr_ = exe.run(
prog,
feed=feed_dict,
fetch_list=[model['loss'], model['predict'], lr],
return_numpy=True)
macro_f1 = f1_score(feed_dict['labels'], predict, average="macro")
micro_f1 = f1_score(feed_dict['labels'], predict, average="micro")
return {
'loss': loss,
'pred': predict,
'lr': lr_,
'macro_f1': macro_f1,
'micro_f1': micro_f1
}
def main(args):
"""main function for training node classification task.
"""
word2id = pkl.load(open(args.word2id, 'rb'))
words_label = load_data(args.dataset, word2id)
# split data for training and testing
split_position = int(words_label.shape[0] * args.train_percent)
train_words_label = words_label[0:split_position, :]
test_words_label = words_label[split_position:, :]
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_prog = fluid.Program()
test_prog = fluid.Program()
startup_prog = fluid.Program()
with fluid.program_guard(train_prog, startup_prog):
with fluid.unique_name.guard():
model = node_classify_model(
word2id, args.num_labels, embed_dim=args.embed_dim)
test_prog = train_prog.clone(for_test=True)
with fluid.program_guard(train_prog, startup_prog):
lr = fl.polynomial_decay(args.lr, 1000, 0.001)
adam = fluid.optimizer.Adam(lr)
adam.minimize(model['loss'])
exe = fluid.Executor(place)
exe.run(startup_prog)
load_param(args.ckpt_path, ['content'])
feed_dict = {}
X = train_words_label[:, 0].reshape(-1, 1)
labels = train_words_label[:, 1].reshape(-1, 1)
logging.info('%d/%d data to train' %
(labels.shape[0], words_label.shape[0]))
test_feed_dict = {}
test_X = test_words_label[:, 0].reshape(-1, 1)
test_labels = test_words_label[:, 1].reshape(-1, 1)
logging.info('%d/%d data to test' %
(test_labels.shape[0], words_label.shape[0]))
for epoch in range(args.epochs):
feed_dict['nodes'] = X
feed_dict['labels'] = labels
train_result = run_epoch(exe, train_prog, model, feed_dict, lr)
test_feed_dict['nodes'] = test_X
test_feed_dict['labels'] = test_labels
test_result = run_epoch(exe, test_prog, model, test_feed_dict, lr=None)
logging.info(
'epoch %d | lr %.4f | train_loss %.5f | train_macro_F1 %.4f | train_micro_F1 %.4f | test_loss %.5f | test_macro_F1 %.4f | test_micro_F1 %.4f'
% (epoch, train_result['lr'], train_result['loss'],
train_result['macro_f1'], train_result['micro_f1'],
test_result['loss'], test_result['macro_f1'],
test_result['micro_f1']))
logging.info(
'final_test_macro_f1 score: %.4f | final_test_micro_f1 score: %.4f' %
(test_result['macro_f1'], test_result['micro_f1']))
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='multi_class')
parser.add_argument(
'--dataset',
default=None,
type=str,
help='training and testing data file(default: None)')
parser.add_argument(
'--word2id',
default=None,
type=str,
help='word2id file (default: None)')
parser.add_argument(
'--ckpt_path', default=None, type=str, help='task name(default: None)')
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
'--train_percent',
default=0.5,
type=float,
help='train_percent(default: 0.5)')
parser.add_argument(
'--num_labels',
default=8,
type=int,
help='number of labels(default: 8)')
parser.add_argument(
'--epochs',
default=100,
type=int,
help='number of epochs for training(default: 10)')
parser.add_argument(
'--lr',
default=0.025,
type=float,
help='learning rate(default: 0.025)')
parser.add_argument(
'--embed_dim',
default=128,
type=int,
help='dimension of embedding(default: 128)')
args = parser.parse_args()
log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s'
logging.basicConfig(level='INFO', format=log_format)
main(args)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the sampler to sample metapath random walk sequence for
training metapath2vec model.
"""
import multiprocessing
from multiprocessing import Pool
from multiprocessing import Process
import argparse
import sys
import os
import numpy as np
import pickle as pkl
import tqdm
import time
import logging
import random
from pgl import heter_graph
from pgl.sample import metapath_randomwalk
from utils import *
class Sampler(object):
"""Implemetation of sampler in order to sample metapath random walk.
Args:
config: dict, some configure parameters.
"""
def __init__(self, config):
self.config = config
self.build_graph()
def build_graph(self):
"""Build pgl heterogeneous graph.
"""
self.conf_id2index, self.conf_name2index, conf_node_type = self.remapping_id(
self.config['data_path'] + 'id_conf.txt',
start_index=0,
node_type='conf')
logging.info('%d venues have been loaded.' % (len(self.conf_id2index)))
self.author_id2index, self.author_name2index, author_node_type = self.remapping_id(
self.config['data_path'] + 'id_author.txt',
start_index=len(self.conf_id2index),
node_type='author')
logging.info('%d authors have been loaded.' %
(len(self.author_id2index)))
self.paper_id2index, self.paper_name2index, paper_node_type = self.remapping_id(
self.config['data_path'] + 'paper.txt',
start_index=(len(self.conf_id2index) + len(self.author_id2index)),
node_type='paper',
separator='\t')
logging.info('%d papers have been loaded.' %
(len(self.paper_id2index)))
node_types = conf_node_type + author_node_type + paper_node_type
num_nodes = len(node_types)
edges_by_types = {}
paper_author_edges = self.load_edges(
self.config['data_path'] + 'paper_author.txt', self.paper_id2index,
self.author_id2index)
paper_conf_edges = self.load_edges(
self.config['data_path'] + 'paper_conf.txt', self.paper_id2index,
self.conf_id2index)
# edges_by_types['edge'] = paper_author_edges + paper_conf_edges
edges_by_types['p2c'] = paper_conf_edges
edges_by_types['c2p'] = [(dst, src) for src, dst in paper_conf_edges]
edges_by_types['p2a'] = paper_author_edges
edges_by_types['a2p'] = [(dst, src) for src, dst in paper_author_edges]
# logging.info('%d edges have been loaded.' %
# (len(edges_by_types['edge'])))
node_features = {
'index': np.array([i for i in range(num_nodes)]).reshape(
-1, 1).astype(np.int64)
}
self.graph = heter_graph.HeterGraph(
num_nodes=num_nodes,
edges=edges_by_types,
node_types=node_types,
node_feat=node_features)
def remapping_id(self, file_, start_index, node_type, separator='\t'):
"""Mapp the ID and name of nodes to index.
"""
node_types = []
id2index = {}
name2index = {}
index = start_index
with open(file_, encoding="ISO-8859-1") as reader:
for line in reader:
tokens = line.strip().split(separator)
id2index[tokens[0]] = index
if len(tokens) == 2:
name2index[tokens[1]] = index
node_types.append((index, node_type))
index += 1
return id2index, name2index, node_types
def load_edges(self, file_, src2index, dst2index, symmetry=False):
"""Load edges from file.
"""
edges = []
with open(file_, 'r') as reader:
for line in reader:
items = line.strip().split()
src, dst = src2index[items[0]], dst2index[items[1]]
edges.append((src, dst))
if symmetry:
edges.append((dst, src))
edges = list(set(edges))
return edges
def generate_multi_class_data(self, name_label_file):
"""Mapp the data that will be used in multi class task to index.
"""
if 'author' in name_label_file:
name2index = self.author_name2index
else:
name2index = self.conf_name2index
index_label_list = []
with open(name_label_file, encoding="ISO-8859-1") as reader:
for line in reader:
tokens = line.strip().split(' ')
name, label = tokens[0], int(tokens[1])
index = name2index[name]
index_label_list.append((index, label))
return index_label_list
def walk_generator(graph, batch_size, metapath, n_type, walk_length):
"""Generate metapath random walk.
"""
np.random.seed(os.getpid())
while True:
for start_nodes in graph.node_batch_iter(
batch_size=batch_size, n_type=n_type):
walks = metapath_randomwalk(
graph=graph,
start_nodes=start_nodes,
metapath=metapath,
walk_length=walk_length)
yield walks
def walk_to_files(g, batch_size, metapath, n_type, walk_length, max_num,
filename):
"""Generate metapath randomwalk and save in files"""
# g, batch_size, metapath, n_type, walk_length, max_num, filename = args
with open(filename, 'w') as writer:
cc = 0
for walks in walk_generator(g, batch_size, metapath, n_type,
walk_length):
for walk in walks:
writer.write("%s\n" % "\t".join([str(i) for i in walk]))
cc += 1
if cc == max_num:
return
return
def multiprocess_generate_walks_to_files(graph, n_type, meta_path, num_walks,
walk_length, batch_size,
num_sample_workers, saved_path):
"""Use multiprocess to generate metapath random walk to files.
"""
num_nodes_by_type = graph.num_nodes_by_type(n_type)
logging.info("num_nodes_by_type: %s" % num_nodes_by_type)
max_num = (num_walks * num_nodes_by_type // num_sample_workers) + 1
logging.info("max sample number of every worker: %s" % max_num)
args = []
for i in range(num_sample_workers):
filename = os.path.join(saved_path, 'part-%05d' % (i))
args.append((graph, batch_size, meta_path, n_type, walk_length,
max_num, filename))
ps = []
for i in range(num_sample_workers):
p = Process(target=walk_to_files, args=args[i])
p.start()
ps.append(p)
for i in range(num_sample_workers):
ps[i].join()
# pool = Pool(num_sample_workers)
# pool.map(walk_to_files, args)
# pool.close()
# pool.join()
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='metapath2vec')
parser.add_argument(
'-c',
'--config',
default=None,
type=str,
help='config file path (default: None)')
args = parser.parse_args()
if args.config:
# load config file
config = Config(args.config, isCreate=False, isSave=False)
config = config()
config = config['sampler']['args']
else:
raise AssertionError(
"Configuration file need to be specified. Add '-c config.yaml', for example."
)
log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s'
logging.basicConfig(level="INFO", format=log_format)
logging.info(config)
log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s'
logging.basicConfig(level=getattr(logging, 'INFO'), format=log_format)
if not os.path.exists(config['output_path']):
os.makedirs(config['output_path'])
config['walk_saved_path'] = config['output_path'] + config[
'walk_saved_path']
if not os.path.exists(config['walk_saved_path']):
os.makedirs(config['walk_saved_path'])
sampler = Sampler(config)
begin = time.time()
logging.info('multi process sampling')
multiprocess_generate_walks_to_files(
graph=sampler.graph,
n_type=config['first_node_type'],
meta_path=config['metapath'],
num_walks=config['num_walks'],
walk_length=config['walk_length'],
batch_size=config['walk_batch_size'],
num_sample_workers=config['num_sample_workers'],
saved_path=config['walk_saved_path'], )
logging.info('total time: %.4f' % (time.time() - begin))
logging.info('generating multi class data')
word_label_list = sampler.generate_multi_class_data(config[
'author_label_file'])
with open(config['output_path'] + config['new_author_label_file'],
'w') as writer:
for line in word_label_list:
line = [str(i) for i in line]
writer.write(' '.join(line) + '\n')
word_label_list = sampler.generate_multi_class_data(config[
'venue_label_file'])
with open(config['output_path'] + config['new_venue_label_file'],
'w') as writer:
for line in word_label_list:
line = [str(i) for i in line]
writer.write(' '.join(line) + '\n')
logging.info('finished')
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement a class for model configure.
"""
import datetime
import os
import yaml
import random
import shutil
class Config(object):
"""Implementation of Config class for model configure.
Args:
config_file(str): configure filename, which is a yaml file.
isCreate(bool): if true, create some neccessary directories to save models, log file and other outputs.
isSave(bool): if true, save config_file in order to record the configure message.
"""
def __init__(self, config_file, isCreate=False, isSave=False):
self.config_file = config_file
self.config = self.get_config_from_yaml(config_file)
if isCreate:
self.create_necessary_dirs()
if isSave:
self.save_config_file()
def get_config_from_yaml(self, yaml_file):
"""Get the configure hyperparameters from yaml file.
"""
try:
with open(yaml_file, 'r') as f:
config = yaml.load(f)
except Exception:
raise IOError("Error in parsing config file '%s'" % yaml_file)
return config
def create_necessary_dirs(self):
"""Create some necessary directories to save some important files.
"""
time_stamp = datetime.datetime.now().strftime('%m%d_%H%M')
self.config['trainer']['args']['log_dir'] = ''.join(
(self.config['trainer']['args']['log_dir'],
self.config['task_name'], '/')) # , '.%s/' % (time_stamp)))
self.config['trainer']['args']['save_dir'] = ''.join(
(self.config['trainer']['args']['save_dir'],
self.config['task_name'], '/')) # , '.%s/' % (time_stamp)))
self.config['trainer']['args']['output_dir'] = ''.join(
(self.config['trainer']['args']['output_dir'],
self.config['task_name'], '/')) # , '.%s/' % (time_stamp)))
# if os.path.exists(self.config['trainer']['args']['save_dir']):
# input('save_dir is existed, do you really want to continue?')
self.make_dir(self.config['trainer']['args']['log_dir'])
self.make_dir(self.config['trainer']['args']['save_dir'])
self.make_dir(self.config['trainer']['args']['output_dir'])
def save_config_file(self):
"""Save config file so that we can know the config when we look back
"""
filename = self.config_file.split('/')[-1]
targetpath = self.config['trainer']['args']['save_dir']
shutil.copyfile(self.config_file, targetpath + filename)
def make_dir(self, path):
"""Build directory"""
if not os.path.exists(path):
os.makedirs(path)
def __getitem__(self, key):
"""Return the configure dict"""
return self.config[key]
def __call__(self):
"""__call__"""
return self.config
# PGL Examples for node2vec
# node2vec: Scalable Feature Learning for Networks
[Node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce node2vec algorithms and reach the same level of indicators as the paper.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3) and [Arxiv](http://snap.stanford.edu/data/ca-AstroPh.html).
......
# PGL - Knowledge Graph Embedding
This package is mainly for computing node and relation embedding of knowledge graphs efficiently.
This package reproduce the following knowledge embedding models:
- TransE
- TransR
- RotatE
### Dataset
The dataset WN18 and FB15k are originally published by TransE paper and can be download [here](https://everest.hds.utc.fr/doku.php?id=en:transe).
FB15k: [https://drive.google.com/open?id=19I3LqaKjgq-3vOs0us7OgEL06TIs37W8](https://drive.google.com/open?id=19I3LqaKjgq-3vOs0us7OgEL06TIs37W8)
WN18: [https://drive.google.com/open?id=1MXy257ZsjeXQHZScHLeQeVnUTPjltlwD](https://drive.google.com/open?id=1MXy257ZsjeXQHZScHLeQeVnUTPjltlwD)
### Dependencies
If you want to use the PGL-KG in paddle, please install following packages.
- paddlepaddle>=1.7
- pgl
### Hyperparameters
- use\_cuda: use cuda to train.
- model: pgl-kg model names. Now available for `TransE`, `TransR` and `RotatE`.
- data\_dir: the data path of dataset.
- optimizer: optimizer to run the model.
- batch\_size: batch size.
- learning\_rate:learning rate.
- epoch: epochs to run.
- evaluate\_per\_iteration: evaluate after certain epochs.
- sample\_workers: sample workers nums to prepare data.
- margin: hyper-parameter for some model.
For more hyper parameters usages, please refer the `main.py`. We also provide `run.sh` script to reproduce performance results (please download dataset in `./data` and specify the data\_dir paramter).
### How to run
For examples, use GPU to train TransR model on WN18 dataset.
(please download WN18 dataset to `./data` floder)
```
python main.py --use_cuda --model TransR --data_dir ./data/WN18
```
We also provide `run.sh` script to reproduce following performance results.
### Experiment results
Here we report the experiment results on FB15k and WN18 dataset. The evaluation criteria are MR (mean rank), Mrr (mean reciprocal rank), Hit@N (The first N hit rate). The suffix `@f` means that we filter the exists relations of entities.
FB15k dataset
| Models | MR | Mrr | Hits@1 | Hits@3 | Hits@10| MR@f |Mrr@f|Hit1@f|Hit3@f|Hits10@f|
|--------|-----|-------|--------|--------|--------|-------|-----|------|------|--------|
| TransE | 215 | 0.205 | 0.093 | 0.234 | 0.446 | 74 |0.379| 0.235| 0.453| 0.647 |
| TransR | 304 | 0.193 | 0.092 | 0.211 | 0.418 | 156 |0.366| 0.232| 0.435| 0.623 |
| RotatE | 157 | 0.270 | 0.162 | 0.303 | 0.501 | 53 |0.478| 0.354| 0.547| 0.710 |
WN18 dataset
| Models | MR | Mrr | Hits@1 | Hits@3 | Hits@10| MR@f |Mrr@f|Hit1@f|Hit3@f|Hits10@f|
|--------|-----|-------|--------|--------|--------|-------|-----|------|------|--------|
| TransE | 219 | 0.338 | 0.082 | 0.523 | 0.800 | 208 |0.463| 0.135| 0.771| 0.932 |
| TransR | 321 | 0.370 | 0.096 | 0.591 | 0.810 | 309 |0.513| 0.158| 0.941| 0.941 |
| RotatE | 167 | 0.623 | 0.476 | 0.688 | 0.830 | 155 |0.915| 0.884| 0.941| 0.957 |
## References
[1]. [TransE: Translating embeddings for modeling multi-relational data.](https://ieeexplore.ieee.org/abstract/document/8047276)
[2]. [TransR: Learning entity and relation embeddings for knowledge graph completion.](http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/viewFile/9571/9523)
[3]. [RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space.](https://arxiv.org/abs/1902.10197)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed und
# er the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
loader for the knowledge dataset.
"""
import os
import numpy as np
from collections import defaultdict
from pgl.utils.logger import log
#from pybloom import BloomFilter
class KGLoader:
"""
load the FB15K
"""
def __init__(self, data_dir, batch_size, neg_mode, neg_times):
"""init"""
self.name = os.path.split(data_dir)[-1]
self._feed_list = ["pos_triple", "neg_triple"]
self._data_dir = data_dir
self._batch_size = batch_size
self._neg_mode = neg_mode
self._neg_times = neg_times
self._entity2id = {}
self._relation2id = {}
self.training_triple_pool = set()
self._triple_train = None
self._triple_test = None
self._triple_valid = None
self.entity_total = 0
self.relation_total = 0
self.train_num = 0
self.test_num = 0
self.valid_num = 0
self.load_data()
def test_data_batch(self, batch_size=None):
"""
Test data reader.
:param batch_size: Todo: batch_size > 1.
:return: None
"""
for i in range(self.test_num):
data = np.array(self._triple_test[i])
data = data.reshape((-1))
yield [data]
def training_data_no_filter(self, train_triple_positive):
"""faster, no filter for exists triples"""
size = len(train_triple_positive) * self._neg_times
train_triple_negative = train_triple_positive.repeat(
self._neg_times, axis=0)
replace_head_probability = 0.5 * np.ones(size)
replace_entity_id = np.random.randint(self.entity_total, size=size)
random_num = np.random.random(size=size)
index_t = (random_num < replace_head_probability) * 1
train_triple_negative[:, 0] = train_triple_negative[:, 0] + (
replace_entity_id - train_triple_negative[:, 0]) * index_t
train_triple_negative[:, 2] = replace_entity_id + (
train_triple_negative[:, 2] - replace_entity_id) * index_t
train_triple_positive = np.expand_dims(train_triple_positive, axis=2)
train_triple_negative = np.expand_dims(train_triple_negative, axis=2)
return train_triple_positive, train_triple_negative
def training_data_map(self, train_triple_positive):
"""
Map function for negative sampling.
:param train_triple_positive: the triple positive.
:return: the positive and negative triples.
"""
size = len(train_triple_positive)
train_triple_negative = []
for i in range(size):
corrupt_head_prob = np.random.binomial(1, 0.5)
head_neg = train_triple_positive[i][0]
relation = train_triple_positive[i][1]
tail_neg = train_triple_positive[i][2]
for j in range(0, self._neg_times):
sample = train_triple_positive[i] + 0
while True:
rand_id = np.random.randint(self.entity_total)
if corrupt_head_prob:
if (rand_id, relation, tail_neg
) not in self.training_triple_pool:
sample[0] = rand_id
train_triple_negative.append(sample)
break
else:
if (head_neg, relation, rand_id
) not in self.training_triple_pool:
sample[2] = rand_id
train_triple_negative.append(sample)
break
train_triple_positive = np.expand_dims(train_triple_positive, axis=2)
train_triple_negative = np.expand_dims(train_triple_negative, axis=2)
if self._neg_mode:
return train_triple_positive, train_triple_negative, np.array(
[corrupt_head_prob], dtype="float32")
return train_triple_positive, train_triple_negative
def training_data_batch(self):
"""
train_triple_positive
:return:
"""
n = len(self._triple_train)
rand_idx = np.random.permutation(n)
n_triple = len(rand_idx)
start = 0
while start < n_triple:
end = min(start + self._batch_size, n_triple)
train_triple_positive = self._triple_train[rand_idx[start:end]]
start = end
yield train_triple_positive
def load_kg_triple(self, file):
"""
Read in kg files.
"""
triples = []
with open(os.path.join(self._data_dir, file), "r") as f:
for line in f.readlines():
line_list = line.strip().split('\t')
assert len(line_list) == 3
head = self._entity2id[line_list[0]]
tail = self._entity2id[line_list[1]]
relation = self._relation2id[line_list[2]]
triples.append((head, relation, tail))
return np.array(triples)
def load_data(self):
"""
load kg dataset.
"""
log.info("Start loading the {} dataset".format(self.name))
with open(os.path.join(self._data_dir, 'entity2id.txt'), "r") as f:
for line in f.readlines():
line = line.strip().split('\t')
self._entity2id[line[0]] = int(line[1])
with open(os.path.join(self._data_dir, 'relation2id.txt'), "r") as f:
for line in f.readlines():
line = line.strip().split('\t')
self._relation2id[line[0]] = int(line[1])
self._triple_train = self.load_kg_triple('train.txt')
self._triple_test = self.load_kg_triple('test.txt')
self._triple_valid = self.load_kg_triple('valid.txt')
self.relation_total = len(self._relation2id)
self.entity_total = len(self._entity2id)
self.train_num = len(self._triple_train)
self.test_num = len(self._triple_test)
self.valid_num = len(self._triple_valid)
#bloom_capacity = len(self._triple_train) + len(self._triple_test) + len(self._triple_valid)
#self.training_triple_pool = BloomFilter(capacity=bloom_capacity, error_rate=0.01)
for i in range(len(self._triple_train)):
self.training_triple_pool.add(
(self._triple_train[i, 0], self._triple_train[i, 1],
self._triple_train[i, 2]))
for i in range(len(self._triple_test)):
self.training_triple_pool.add(
(self._triple_test[i, 0], self._triple_test[i, 1],
self._triple_test[i, 2]))
for i in range(len(self._triple_valid)):
self.training_triple_pool.add(
(self._triple_valid[i, 0], self._triple_valid[i, 1],
self._triple_valid[i, 2]))
log.info('entity number: {}'.format(self.entity_total))
log.info('relation number: {}'.format(self.relation_total))
log.info('training triple number: {}'.format(self.train_num))
log.info('testing triple number: {}'.format(self.test_num))
log.info('valid triple number: {}'.format(self.valid_num))
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Evaluate.py: Evaluator for the results of knowledge graph embeddings.
"""
import numpy as np
import timeit
from mp_mapper import mp_reader_mapper
from pgl.utils.logger import log
class Evaluate:
"""
Evaluate for trained models.
"""
def __init__(self, reader):
self.reader = reader
self.training_triple_pool = self.reader.training_triple_pool
@staticmethod
def rank_extract(results, training_triple_pool):
"""
:param results: the scores of test examples.
:param training_triple_pool: existing edges.
:return: the ranks.
"""
eval_triple, head_score, tail_score = results
head_order = np.argsort(head_score)
tail_order = np.argsort(tail_score)
head, relation, tail = eval_triple[0], eval_triple[1], eval_triple[2]
head_rank_raw = 1
tail_rank_raw = 1
head_rank_filter = 1
tail_rank_filter = 1
for candidate in head_order:
if candidate == head:
break
else:
head_rank_raw += 1
if (candidate, relation, tail) in training_triple_pool:
continue
else:
head_rank_filter += 1
for candidate in tail_order:
if candidate == tail:
break
else:
tail_rank_raw += 1
if (head, relation, candidate) in training_triple_pool:
continue
else:
tail_rank_filter += 1
return head_rank_raw, tail_rank_raw, head_rank_filter, tail_rank_filter
def launch_evaluation(self,
exe,
program,
reader,
fetch_list,
num_workers=4):
"""
launch_evaluation
:param exe: executor.
:param program: paddle program.
:param reader: test reader.
:param fetch_list: fetch list.
:param num_workers: num of workers.
:return: None
"""
def func(training_triple_pool):
"""func"""
def run_func(results):
"""run_func"""
return self.rank_extract(results, training_triple_pool)
return run_func
def iterator():
"""iterator"""
n_used_eval_triple = 0
start = timeit.default_timer()
for batch_feed_dict in reader():
head_score, tail_score = exe.run(program=program,
fetch_list=fetch_list,
feed=batch_feed_dict)
yield batch_feed_dict["test_triple"], head_score, tail_score
n_used_eval_triple += 1
if n_used_eval_triple % 500 == 0:
print('[{:.3f}s] #evaluation triple: {}/{}'.format(
timeit.default_timer(
) - start, n_used_eval_triple, self.reader.test_num))
res_reader = mp_reader_mapper(
reader=iterator,
func=func(self.training_triple_pool),
num_works=num_workers)
self.result(res_reader)
@staticmethod
def result(rank_result_iter):
"""
Calculate the final results.
:param rank_result_iter: results iter.
:return: None
"""
all_rank = [[], []]
for data in rank_result_iter():
for i in range(4):
all_rank[i // 2].append(data[i])
raw_rank = np.array(all_rank[0])
filter_rank = np.array(all_rank[1])
log.info("-----Raw-Average-Results")
log.info(
'MeanRank: {:.2f}, MRR: {:.4f}, Hits@1: {:.4f}, Hits@3: {:.4f}, Hits@10: {:.4f}'.
format(raw_rank.mean(), (1 / raw_rank).mean(), (raw_rank <= 1).
mean(), (raw_rank <= 3).mean(), (raw_rank <= 10).mean()))
log.info("-----Filter-Average-Results")
log.info(
'MeanRank: {:.2f}, MRR: {:.4f}, Hits@1: {:.4f}, Hits@3: {:.4f}, Hits@10: {:.4f}'.
format(filter_rank.mean(), (1 / filter_rank).mean(), (
filter_rank <= 1).mean(), (filter_rank <= 3).mean(), (
filter_rank <= 10).mean()))
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
The script to run these models.
"""
import argparse
import timeit
import os
import numpy as np
import paddle.fluid as fluid
from data_loader import KGLoader
from evalutate import Evaluate
from model import model_dict
from model.utils import load_var
from mp_mapper import mp_reader_mapper
from pgl.utils.logger import log
def run_round(batch_iter,
program,
exe,
fetch_list,
epoch,
prefix="train",
log_per_step=1000):
"""
Run the program for one epoch.
:param batch_iter: the batch_iter of prepared data.
:param program: the running program, train_program or test program.
:param exe: the executor of paddle.
:param fetch_list: the variables to fetch.
:param epoch: the epoch number of train process.
:param prefix: the prefix name, type `string`.
:param log_per_step: log per step.
:return: None
"""
batch = 0
tmp_epoch = 0
loss = 0
tmp_loss = 0
run_time = 0
data_time = 0
t2 = timeit.default_timer()
start_epoch_time = timeit.default_timer()
for batch_feed_dict in batch_iter():
batch += 1
t1 = timeit.default_timer()
data_time += (t1 - t2)
batch_fetch = exe.run(program,
fetch_list=fetch_list,
feed=batch_feed_dict)
if prefix == "train":
loss += batch_fetch[0]
tmp_loss += batch_fetch[0]
if batch % log_per_step == 0:
tmp_epoch += 1
if prefix == "train":
log.info("Epoch %s (%.7f sec) Train Loss: %.7f" %
(epoch + tmp_epoch,
timeit.default_timer() - start_epoch_time,
tmp_loss[0] / batch))
start_epoch_time = timeit.default_timer()
else:
log.info("Batch %s" % batch)
batch = 0
tmp_loss = 0
t2 = timeit.default_timer()
run_time += (t2 - t1)
if prefix == "train":
log.info("GPU run time {}, Data prepare extra time {}".format(
run_time, data_time))
log.info("Epoch %s \t All Loss %s" % (epoch + tmp_epoch, loss))
def train(args):
"""
Train the knowledge graph embedding model.
:param args: all args.
:return: None
"""
kgreader = KGLoader(
batch_size=args.batch_size,
data_dir=args.data_dir,
neg_mode=args.neg_mode,
neg_times=args.neg_times)
if args.model in model_dict:
Model = model_dict[args.model]
else:
raise ValueError("No model for name {}".format(args.model))
model = Model(
data_reader=kgreader,
hidden_size=args.hidden_size,
margin=args.margin,
learning_rate=args.learning_rate,
args=args,
optimizer=args.optimizer)
def iter_map_wrapper(data_batch, repeat=1):
"""
wrapper for multiprocess reader
:param data_batch: the source data iter.
:param repeat: repeat data for multi epoch
:return: iterator of feed data
"""
def data_repeat():
"""repeat data for multi epoch"""
for i in range(repeat):
for d in data_batch():
yield d
reader = mp_reader_mapper(
data_repeat,
func=kgreader.training_data_no_filter
if args.nofilter else kgreader.training_data_map,
num_works=args.sample_workers)
return reader
def iter_wrapper(data_batch, feed_list):
"""
Decorator of make up the feed dict
:param data_batch: the source data iter.
:param feed_list: the feed list (names of variables).
:return: iterator of feed data.
"""
def work():
"""work"""
for batch in data_batch():
feed_dict = {}
for k, v in zip(feed_list, batch):
feed_dict[k] = v
yield feed_dict
return work
loader = fluid.io.DataLoader.from_generator(
feed_list=model.train_feed_vars, capacity=20, iterable=True)
places = fluid.cuda_places() if args.use_cuda else fluid.cpu_places()
exe = fluid.Executor(places[0])
exe.run(model.startup_program)
exe.run(fluid.default_startup_program())
if args.pretrain and model.model_name in ["TransR", "transr"]:
pretrain_ent = os.path.join(args.checkpoint,
model.ent_name.replace("TransR", "TransE"))
pretrain_rel = os.path.join(args.checkpoint,
model.rel_name.replace("TransR", "TransE"))
if os.path.exists(pretrain_ent):
print("loading pretrain!")
#var = fluid.global_scope().find_var(model.ent_name)
load_var(exe, model.train_program, model.ent_name, pretrain_ent)
#var = fluid.global_scope().find_var(model.rel_name)
load_var(exe, model.train_program, model.rel_name, pretrain_rel)
else:
raise ValueError("pretrain file {} not exists!".format(
pretrain_ent))
prog = fluid.CompiledProgram(model.train_program).with_data_parallel(
loss_name=model.train_fetch_vars[0].name)
if args.only_evaluate:
s = timeit.default_timer()
fluid.io.load_params(
exe, dirname=args.checkpoint, main_program=model.train_program)
Evaluate(kgreader).launch_evaluation(
exe=exe,
reader=iter_wrapper(kgreader.test_data_batch,
model.test_feed_list),
fetch_list=model.test_fetch_vars,
program=model.test_program,
num_workers=10)
log.info(timeit.default_timer() - s)
return None
batch_iter = iter_map_wrapper(
kgreader.training_data_batch,
repeat=args.evaluate_per_iteration, )
loader.set_batch_generator(batch_iter, places=places)
for epoch in range(0, args.epoch // args.evaluate_per_iteration):
run_round(
batch_iter=loader,
exe=exe,
prefix="train",
# program=model.train_program,
program=prog,
fetch_list=model.train_fetch_vars,
log_per_step=kgreader.train_num // args.batch_size,
epoch=epoch * args.evaluate_per_iteration)
log.info("epoch\t%s" % ((1 + epoch) * args.evaluate_per_iteration))
fluid.io.save_params(
exe, dirname=args.checkpoint, main_program=model.train_program)
if not args.noeval:
eva = Evaluate(kgreader)
eva.launch_evaluation(
exe=exe,
reader=iter_wrapper(kgreader.test_data_batch,
model.test_feed_list),
fetch_list=model.test_fetch_vars,
program=model.test_program,
num_workers=10)
def main():
"""
The main entry of all.
:return: None
"""
parser = argparse.ArgumentParser(
description="Knowledge Graph Embedding for PGL")
parser.add_argument('--use_cuda', action='store_true', help="use_cuda")
parser.add_argument(
'--data_dir',
dest='data_dir',
type=str,
help='the directory of dataset',
default='./data/WN18/')
parser.add_argument(
'--model',
dest='model',
type=str,
help="model to run",
default="TransE")
parser.add_argument(
'--learning_rate',
dest='learning_rate',
type=float,
help='learning rate',
default=0.001)
parser.add_argument(
'--epoch', dest='epoch', type=int, help='epoch to run', default=400)
parser.add_argument(
'--sample_workers',
dest='sample_workers',
type=int,
help='sample workers',
default=4)
parser.add_argument(
'--batch_size',
dest='batch_size',
type=int,
help="batch size",
default=1000)
parser.add_argument(
'--optimizer',
dest='optimizer',
type=str,
help='optimizer',
default='adam')
parser.add_argument(
'--hidden_size',
dest='hidden_size',
type=int,
help='embedding dimension',
default=50)
parser.add_argument(
'--margin', dest='margin', type=float, help='margin', default=4.0)
parser.add_argument(
'--checkpoint',
dest='checkpoint',
type=str,
help='directory to save checkpoint directory',
default='output/')
parser.add_argument(
'--evaluate_per_iteration',
dest='evaluate_per_iteration',
type=int,
help='evaluate the training result per x iteration',
default=50)
parser.add_argument(
'--only_evaluate',
dest='only_evaluate',
action='store_true',
help='only do the evaluate program',
default=False)
parser.add_argument(
'--adv_temp_value', type=float, help='adv_temp_value', default=2.0)
parser.add_argument('--neg_times', type=int, help='neg_times', default=1)
parser.add_argument(
'--neg_mode', type=bool, help='return neg mode flag', default=False)
parser.add_argument(
'--nofilter',
type=bool,
help='don\'t filter invalid examples',
default=False)
parser.add_argument(
'--pretrain',
type=bool,
help='pretrain for TransR model',
default=False)
parser.add_argument(
'--noeval',
type=bool,
help='whether to evaluate the result',
default=False)
args = parser.parse_args()
log.info(args)
train(args)
if __name__ == '__main__':
main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Base model of the knowledge graph embedding model.
"""
from paddle import fluid
class Model(object):
"""
Base model.
"""
def __init__(self, **kwargs):
"""
Init model
"""
# Needed parameters
self.model_name = kwargs["model_name"]
self.data_reader = kwargs["data_reader"]
self._hidden_size = kwargs["hidden_size"]
self._learning_rate = kwargs["learning_rate"]
self._optimizer = kwargs["optimizer"]
self.args = kwargs["args"]
# Optional parameters
if "margin" in kwargs:
self._margin = kwargs["margin"]
self._prefix = "%s_%s_dim=%d_" % (
self.model_name, self.data_reader.name, self._hidden_size)
self.ent_name = self._prefix + "entity_embeddings"
self.rel_name = self._prefix + "relation_embeddings"
self._entity_total = self.data_reader.entity_total
self._relation_total = self.data_reader.relation_total
self._ent_shape = [self._entity_total, self._hidden_size]
self._rel_shape = [self._relation_total, self._hidden_size]
def construct(self):
"""
Construct the program
:return: None
"""
self.startup_program = fluid.Program()
self.train_program = fluid.Program()
self.test_program = fluid.Program()
with fluid.program_guard(self.train_program, self.startup_program):
self.train_pos_input = fluid.layers.data(
"pos_triple",
dtype="int64",
shape=[None, 3, 1],
append_batch_size=False)
self.train_neg_input = fluid.layers.data(
"neg_triple",
dtype="int64",
shape=[None, 3, 1],
append_batch_size=False)
self.train_feed_list = ["pos_triple", "neg_triple"]
self.train_feed_vars = [self.train_pos_input, self.train_neg_input]
self.train_fetch_vars = self.construct_train_program()
loss = self.train_fetch_vars[0]
self.apply_optimizer(loss, opt=self._optimizer)
with fluid.program_guard(self.test_program, self.startup_program):
self.test_input = fluid.layers.data(
"test_triple",
dtype="int64",
shape=[3],
append_batch_size=False)
self.test_feed_list = ["test_triple"]
self.test_fetch_vars = self.construct_test_program()
def apply_optimizer(self, loss, opt="sgd"):
"""
Construct the backward of the train program.
:param loss: `type : variable` final loss of the model.
:param opt: `type : string` the optimizer name
:return:
"""
optimizer_available = {
"adam": fluid.optimizer.Adam,
"sgd": fluid.optimizer.SGD,
"momentum": fluid.optimizer.Momentum
}
if opt in optimizer_available:
opt_func = optimizer_available[opt]
else:
opt_func = None
if opt_func is None:
raise ValueError("You should chose the optimizer in %s" %
optimizer_available.keys())
else:
optimizer = opt_func(learning_rate=self._learning_rate)
return optimizer.minimize(loss)
def construct_train_program(self):
"""
This function should construct the train program with the `self.train_pos_input`
and `self.train_neg_input`. These inputs are batch of triples.
:return: List of variables you want to get. Please be sure the ':var loss' should
be in the first place, eg. [loss, variable1, variable2, ...].
"""
raise NotImplementedError(
"You should define the construct_train_program"
" function before use it!")
def construct_test_program(self):
"""
This function should construct test (or evaluate) program with the 'self.test_input'.
Util now, we only support a triple the evaluate the ranks.
:return: the distance of all entity with the test triple (for both head and tail entity).
"""
raise NotImplementedError(
"You should define the construct_test_program"
" function before use it")
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
RotatE:
"RotatE: Knowledge Graph Embedding by Relational Rotation in Complex Space."
Sun, Zhiqing, et al.
https://arxiv.org/abs/1902.10197
"""
import paddle.fluid as fluid
from .Model import Model
from .utils import lookup_table
class RotatE(Model):
"""
RotatE model.
"""
def __init__(self,
data_reader,
hidden_size,
margin,
learning_rate,
args,
optimizer="adam"):
super(RotatE, self).__init__(
model_name="RotatE",
data_reader=data_reader,
hidden_size=hidden_size,
margin=margin,
learning_rate=learning_rate,
args=args,
optimizer=optimizer)
self._neg_times = self.args.neg_times
self._adv_temp_value = self.args.adv_temp_value
self._relation_hidden_size = self._hidden_size
self._entity_hidden_size = self._hidden_size * 2
self._entity_embedding_margin = (
self._margin + 2) / self._entity_hidden_size
self._relation_embedding_margin = (
self._margin + 2) / self._relation_hidden_size
self._rel_shape = [self._relation_total, self._relation_hidden_size]
self._ent_shape = [self._entity_total, self._entity_hidden_size]
self._pi = 3.141592654
self.construct_program()
def construct_program(self):
"""
construct the main program for train and test
"""
self.startup_program = fluid.Program()
self.train_program = fluid.Program()
self.test_program = fluid.Program()
with fluid.program_guard(self.train_program, self.startup_program):
self.train_pos_input = fluid.layers.data(
"pos_triple",
dtype="int64",
shape=[None, 3, 1],
append_batch_size=False)
self.train_neg_input = fluid.layers.data(
"neg_triple",
dtype="int64",
shape=[None, 3, 1],
append_batch_size=False)
self.train_neg_mode = fluid.layers.data(
"neg_mode",
dtype='float32',
shape=[1],
append_batch_size=False)
self.train_feed_vars = [
self.train_pos_input, self.train_neg_input, self.train_neg_mode
]
self.train_fetch_vars = self.construct_train_program()
loss = self.train_fetch_vars[0]
self.apply_optimizer(loss, opt=self._optimizer)
with fluid.program_guard(self.test_program, self.startup_program):
self.test_input = fluid.layers.data(
"test_triple",
dtype="int64",
shape=[3],
append_batch_size=False)
self.test_feed_list = ["test_triple"]
self.test_fetch_vars = self.construct_test_program()
def creat_share_variables(self):
"""
Share variables for train and test programs.
"""
entity_embedding = fluid.layers.create_parameter(
shape=self._ent_shape,
dtype="float32",
name=self.ent_name,
default_initializer=fluid.initializer.Uniform(
low=-1.0 * self._entity_embedding_margin,
high=1.0 * self._entity_embedding_margin))
relation_embedding = fluid.layers.create_parameter(
shape=self._rel_shape,
dtype="float32",
name=self.rel_name,
default_initializer=fluid.initializer.Uniform(
low=-1.0 * self._relation_embedding_margin,
high=1.0 * self._relation_embedding_margin))
return entity_embedding, relation_embedding
def score_with_l2_normalize(self, head, tail, rel, epsilon_var,
train_neg_mode):
"""
Score function of RotatE
"""
one_var = fluid.layers.fill_constant(
shape=[1], dtype='float32', value=1.0)
re_head, im_head = fluid.layers.split(head, num_or_sections=2, dim=-1)
re_tail, im_tail = fluid.layers.split(tail, num_or_sections=2, dim=-1)
phase_relation = rel / (self._relation_embedding_margin / self._pi)
re_relation = fluid.layers.cos(phase_relation)
im_relation = fluid.layers.sin(phase_relation)
re_score = re_relation * re_tail + im_relation * im_tail
im_score = re_relation * im_tail - im_relation * re_tail
re_score = re_score - re_head
im_score = im_score - im_head
#with fluid.layers.control_flow.Switch() as switch:
# with switch.case(train_neg_mode == one_var):
# re_score = re_relation * re_tail + im_relation * im_tail
# im_score = re_relation * im_tail - im_relation * re_tail
# re_score = re_score - re_head
# im_score = im_score - im_head
# with switch.default():
# re_score = re_head * re_relation - im_head * im_relation
# im_score = re_head * im_relation + im_head * re_relation
# re_score = re_score - re_tail
# im_score = im_score - im_tail
re_score = re_score * re_score
im_score = im_score * im_score
score = re_score + im_score
score = score + epsilon_var
score = fluid.layers.sqrt(score)
score = fluid.layers.reduce_sum(score, dim=-1)
return self._margin - score
def adverarial_weight(self, score):
"""
adverarial the weight for softmax
"""
adv_score = self._adv_temp_value * score
adv_softmax = fluid.layers.softmax(adv_score)
return adv_softmax
def construct_train_program(self):
"""
Construct train program
"""
zero_var = fluid.layers.fill_constant(
shape=[1], dtype='float32', value=0.0)
epsilon_var = fluid.layers.fill_constant(
shape=[1], dtype='float32', value=1e-12)
entity_embedding, relation_embedding = self.creat_share_variables()
pos_head = lookup_table(self.train_pos_input[:, 0], entity_embedding)
pos_tail = lookup_table(self.train_pos_input[:, 2], entity_embedding)
pos_rel = lookup_table(self.train_pos_input[:, 1], relation_embedding)
neg_head = lookup_table(self.train_neg_input[:, 0], entity_embedding)
neg_tail = lookup_table(self.train_neg_input[:, 2], entity_embedding)
neg_rel = lookup_table(self.train_neg_input[:, 1], relation_embedding)
pos_score = self.score_with_l2_normalize(pos_head, pos_tail, pos_rel,
epsilon_var, zero_var)
neg_score = self.score_with_l2_normalize(
neg_head, neg_tail, neg_rel, epsilon_var, self.train_neg_mode)
neg_score = fluid.layers.reshape(
neg_score, shape=[-1, self._neg_times], inplace=True)
if self._adv_temp_value > 0.0:
sigmoid_pos_score = fluid.layers.logsigmoid(1.0 * pos_score)
sigmoid_neg_score = fluid.layers.logsigmoid(
-1.0 * neg_score) * self.adverarial_weight(neg_score)
sigmoid_neg_score = fluid.layers.reduce_sum(
sigmoid_neg_score, dim=-1)
else:
sigmoid_pos_score = fluid.layers.logsigmoid(pos_score)
sigmoid_neg_score = fluid.layers.logsigmoid(-1.0 * neg_score)
loss_1 = fluid.layers.mean(sigmoid_pos_score)
loss_2 = fluid.layers.mean(sigmoid_neg_score)
loss = -1.0 * (loss_1 + loss_2) / 2
return [loss]
def score_with_l2_normalize_with_validate(self, entity_embedding, head,
rel, tail, epsilon_var):
"""
the score function for validation
"""
re_entity_embedding, im_entity_embedding = fluid.layers.split(
entity_embedding, num_or_sections=2, dim=-1)
re_head, im_head = fluid.layers.split(head, num_or_sections=2, dim=-1)
re_tail, im_tail = fluid.layers.split(tail, num_or_sections=2, dim=-1)
phase_relation = rel / (self._relation_embedding_margin / self._pi)
re_relation = fluid.layers.cos(phase_relation)
im_relation = fluid.layers.sin(phase_relation)
re_score = re_relation * re_tail + im_relation * im_tail
im_score = re_relation * im_tail - im_relation * re_tail
re_score = re_entity_embedding - re_score
im_score = im_entity_embedding - im_score
re_score = re_score * re_score
im_score = im_score * im_score
head_score = re_score + im_score
head_score += epsilon_var
head_score = fluid.layers.sqrt(head_score)
head_score = fluid.layers.reduce_sum(head_score, dim=-1)
re_score = re_head * re_relation - im_head * im_relation
im_score = re_head * im_relation + im_head * re_relation
re_score = re_entity_embedding - re_score
im_score = im_entity_embedding - im_score
re_score = re_score * re_score
im_score = im_score * im_score
tail_score = re_score + im_score
tail_score += epsilon_var
tail_score = fluid.layers.sqrt(tail_score)
tail_score = fluid.layers.reduce_sum(tail_score, dim=-1)
return head_score, tail_score
def construct_test_program(self):
"""
Construct test program
"""
epsilon_var = fluid.layers.fill_constant(
shape=[1], dtype='float32', value=1e-12)
entity_embedding, relation_embedding = self.creat_share_variables()
head_vec = lookup_table(self.test_input[0], entity_embedding)
rel_vec = lookup_table(self.test_input[1], relation_embedding)
tail_vec = lookup_table(self.test_input[2], entity_embedding)
head_vec = fluid.layers.unsqueeze(head_vec, axes=[0])
rel_vec = fluid.layers.unsqueeze(rel_vec, axes=[0])
tail_vec = fluid.layers.unsqueeze(tail_vec, axes=[0])
id_replace_head, id_replace_tail = self.score_with_l2_normalize_with_validate(
entity_embedding, head_vec, rel_vec, tail_vec, epsilon_var)
id_replace_head = fluid.layers.logsigmoid(id_replace_head)
id_replace_tail = fluid.layers.logsigmoid(id_replace_tail)
return [id_replace_head, id_replace_tail]
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
TransE:
"Translating embeddings for modeling multi-relational data."
Bordes, Antoine, et al.
https://www.utc.fr/~bordesan/dokuwiki/_media/en/transe_nips13.pdf
"""
import paddle.fluid as fluid
from .Model import Model
from .utils import lookup_table
class TransE(Model):
"""
The TransE Model.
"""
def __init__(self,
data_reader,
hidden_size,
margin,
learning_rate,
args,
optimizer="adam"):
self._neg_times = args.neg_times
super(TransE, self).__init__(
model_name="TransE",
data_reader=data_reader,
hidden_size=hidden_size,
margin=margin,
learning_rate=learning_rate,
args=args,
optimizer=optimizer)
self.construct()
def creat_share_variables(self):
"""
Share variables for train and test programs.
"""
entity_embedding = fluid.layers.create_parameter(
shape=self._ent_shape, dtype="float32", name=self.ent_name)
relation_embedding = fluid.layers.create_parameter(
shape=self._rel_shape, dtype="float32", name=self.rel_name)
return entity_embedding, relation_embedding
@staticmethod
def score_with_l2_normalize(head, rel, tail):
"""
Score function of TransE
"""
head = fluid.layers.l2_normalize(head, axis=-1)
rel = fluid.layers.l2_normalize(rel, axis=-1)
tail = fluid.layers.l2_normalize(tail, axis=-1)
score = head + rel - tail
return score
def construct_train_program(self):
"""
Construct train program.
"""
entity_embedding, relation_embedding = self.creat_share_variables()
pos_head = lookup_table(self.train_pos_input[:, 0], entity_embedding)
pos_tail = lookup_table(self.train_pos_input[:, 2], entity_embedding)
pos_rel = lookup_table(self.train_pos_input[:, 1], relation_embedding)
neg_head = lookup_table(self.train_neg_input[:, 0], entity_embedding)
neg_tail = lookup_table(self.train_neg_input[:, 2], entity_embedding)
neg_rel = lookup_table(self.train_neg_input[:, 1], relation_embedding)
pos_score = self.score_with_l2_normalize(pos_head, pos_rel, pos_tail)
neg_score = self.score_with_l2_normalize(neg_head, neg_rel, neg_tail)
pos = fluid.layers.reduce_sum(
fluid.layers.abs(pos_score), 1, keep_dim=False)
neg = fluid.layers.reduce_sum(
fluid.layers.abs(neg_score), 1, keep_dim=False)
neg = fluid.layers.reshape(
neg, shape=[-1, self._neg_times], inplace=True)
loss = fluid.layers.reduce_mean(
fluid.layers.relu(pos - neg + self._margin))
return [loss]
def construct_test_program(self):
"""
Construct test program
"""
entity_embedding, relation_embedding = self.creat_share_variables()
entity_embedding = fluid.layers.l2_normalize(entity_embedding, axis=-1)
relation_embedding = fluid.layers.l2_normalize(
relation_embedding, axis=-1)
head_vec = lookup_table(self.test_input[0], entity_embedding)
rel_vec = lookup_table(self.test_input[1], relation_embedding)
tail_vec = lookup_table(self.test_input[2], entity_embedding)
# The paddle fluid.layers.topk GPU OP is very inefficient
# we do sort operation in the evaluation step using multiprocessing.
id_replace_head = fluid.layers.reduce_sum(
fluid.layers.abs(entity_embedding + rel_vec - tail_vec), dim=1)
id_replace_tail = fluid.layers.reduce_sum(
fluid.layers.abs(entity_embedding - rel_vec - head_vec), dim=1)
return [id_replace_head, id_replace_tail]
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
TransR:
"Learning entity and relation embeddings for knowledge graph completion."
Lin, Yankai, et al.
https://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9571/9523
"""
import numpy as np
import paddle.fluid as fluid
from .Model import Model
from .utils import lookup_table
class TransR(Model):
"""
TransR model.
"""
def __init__(self,
data_reader,
hidden_size,
margin,
learning_rate,
args,
optimizer="adam"):
"""init"""
self._neg_times = args.neg_times
super(TransR, self).__init__(
model_name="TransR",
data_reader=data_reader,
hidden_size=hidden_size,
margin=margin,
learning_rate=learning_rate,
args=args,
optimizer=optimizer)
self.construct()
def creat_share_variables(self):
"""
Share variables for train and test programs.
"""
entity_embedding = fluid.layers.create_parameter(
shape=self._ent_shape,
dtype="float32",
name=self.ent_name,
default_initializer=fluid.initializer.Xavier())
relation_embedding = fluid.layers.create_parameter(
shape=self._rel_shape,
dtype="float32",
name=self.rel_name,
default_initializer=fluid.initializer.Xavier())
init_values = np.tile(
np.identity(
self._hidden_size, dtype="float32").reshape(-1),
(self._relation_total, 1))
transfer_matrix = fluid.layers.create_parameter(
shape=[
self._relation_total, self._hidden_size * self._hidden_size
],
dtype="float32",
name=self._prefix + "transfer_matrix",
default_initializer=fluid.initializer.NumpyArrayInitializer(
init_values))
return entity_embedding, relation_embedding, transfer_matrix
def score_with_l2_normalize(self, head, rel, tail):
"""
Score function of TransR
"""
head = fluid.layers.l2_normalize(head, axis=-1)
rel = fluid.layers.l2_normalize(rel, axis=-1)
tail = fluid.layers.l2_normalize(tail, axis=-1)
score = head + rel - tail
return score
@staticmethod
def matmul_with_expend_dims(x, y):
"""matmul_with_expend_dims"""
x = fluid.layers.unsqueeze(x, axes=[1])
res = fluid.layers.matmul(x, y)
return fluid.layers.squeeze(res, axes=[1])
def construct_train_program(self):
"""
Construct train program
"""
entity_embedding, relation_embedding, transfer_matrix = self.creat_share_variables(
)
pos_head = lookup_table(self.train_pos_input[:, 0], entity_embedding)
pos_tail = lookup_table(self.train_pos_input[:, 2], entity_embedding)
pos_rel = lookup_table(self.train_pos_input[:, 1], relation_embedding)
neg_head = lookup_table(self.train_neg_input[:, 0], entity_embedding)
neg_tail = lookup_table(self.train_neg_input[:, 2], entity_embedding)
neg_rel = lookup_table(self.train_neg_input[:, 1], relation_embedding)
rel_matrix = fluid.layers.reshape(
lookup_table(self.train_pos_input[:, 1], transfer_matrix),
[-1, self._hidden_size, self._hidden_size])
pos_head_trans = self.matmul_with_expend_dims(pos_head, rel_matrix)
pos_tail_trans = self.matmul_with_expend_dims(pos_tail, rel_matrix)
trans_neg = True
if trans_neg:
rel_matrix_neg = fluid.layers.reshape(
lookup_table(self.train_neg_input[:, 1], transfer_matrix),
[-1, self._hidden_size, self._hidden_size])
neg_head_trans = self.matmul_with_expend_dims(neg_head,
rel_matrix_neg)
neg_tail_trans = self.matmul_with_expend_dims(neg_tail,
rel_matrix_neg)
else:
neg_head_trans = self.matmul_with_expend_dims(neg_head, rel_matrix)
neg_tail_trans = self.matmul_with_expend_dims(neg_tail, rel_matrix)
pos_score = self.score_with_l2_normalize(pos_head_trans, pos_rel,
pos_tail_trans)
neg_score = self.score_with_l2_normalize(neg_head_trans, neg_rel,
neg_tail_trans)
pos = fluid.layers.reduce_sum(
fluid.layers.abs(pos_score), -1, keep_dim=False)
neg = fluid.layers.reduce_sum(
fluid.layers.abs(neg_score), -1, keep_dim=False)
neg = fluid.layers.reshape(
neg, shape=[-1, self._neg_times], inplace=True)
loss = fluid.layers.reduce_mean(
fluid.layers.relu(pos - neg + self._margin))
return [loss]
def construct_test_program(self):
"""
Construct test program
"""
entity_embedding, relation_embedding, transfer_matrix = self.creat_share_variables(
)
rel_matrix = fluid.layers.reshape(
lookup_table(self.test_input[1], transfer_matrix),
[self._hidden_size, self._hidden_size])
entity_embedding_trans = fluid.layers.matmul(entity_embedding,
rel_matrix, False, False)
rel_vec = lookup_table(self.test_input[1], relation_embedding)
entity_embedding_trans = fluid.layers.l2_normalize(
entity_embedding_trans, axis=-1)
rel_vec = fluid.layers.l2_normalize(rel_vec, axis=-1)
head_vec = lookup_table(self.test_input[0], entity_embedding_trans)
tail_vec = lookup_table(self.test_input[2], entity_embedding_trans)
# The paddle fluid.layers.topk GPU OP is very inefficient
# we do sort operation in the evaluation step using multiprocessing
id_replace_head = fluid.layers.reduce_sum(
fluid.layers.abs(entity_embedding_trans + rel_vec - tail_vec),
dim=1)
id_replace_tail = fluid.layers.reduce_sum(
fluid.layers.abs(entity_embedding_trans - rel_vec - head_vec),
dim=1)
return [id_replace_head, id_replace_tail]
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""import all models"""
from .TransE import TransE
from .TransR import TransR
from .RotatE import RotatE
model_dict = {
"TransE": TransE,
"transe": TransE,
"TransR": TransR,
"transr": TransR,
"RotatE": RotatE
}
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Utils for the models.
"""
import paddle.fluid as fluid
from paddle.fluid.layer_helper import LayerHelper
def lookup_table(input, embedding_table, dtype='float32'):
"""
lookup table support for paddle.
:param input:
:param embedding_table:
:param dtype:
:return:
"""
is_sparse = False
is_distributed = False
helper = LayerHelper('embedding', **locals())
remote_prefetch = is_sparse and (not is_distributed)
if remote_prefetch:
assert is_sparse is True and is_distributed is False
tmp = helper.create_variable_for_type_inference(dtype)
padding_idx = -1
helper.append_op(
type='lookup_table',
inputs={'Ids': input,
'W': embedding_table},
outputs={'Out': tmp},
attrs={
'is_sparse': is_sparse,
'is_distributed': is_distributed,
'remote_prefetch': remote_prefetch,
'padding_idx': padding_idx
})
return tmp
def lookup_table_gather(index, input):
"""
lookup table support for paddle by gather.
:param index:
:param input:
:return:
"""
return fluid.layers.gather(index=index, input=input, overwrite=False)
def _clone_var_in_block_(block, var):
assert isinstance(var, fluid.Variable)
if var.desc.type() == fluid.core.VarDesc.VarType.LOD_TENSOR:
return block.create_var(
name=var.name,
shape=var.shape,
dtype=var.dtype,
type=var.type,
lod_level=var.lod_level,
persistable=True)
else:
return block.create_var(
name=var.name,
shape=var.shape,
dtype=var.dtype,
type=var.type,
persistable=True)
def load_var(executor, main_program=None, var=None, filename=None):
"""
load_var to certain program
:param executor: executor
:param main_program: the program to load
:param var: the variable name in main_program.
:file_name: the file name of the file to load.
:return: None
"""
load_prog = fluid.Program()
load_block = load_prog.global_block()
if main_program is None:
main_program = fluid.default_main_program()
if not isinstance(main_program, fluid.Program):
raise TypeError("program should be as Program type or None")
vars = list(filter(None, main_program.list_vars()))
# save origin param shape
orig_para_shape = {}
load_var_map = {}
for each_var in vars:
if each_var.name != var:
continue
assert isinstance(each_var, fluid.Variable)
if each_var.type == fluid.core.VarDesc.VarType.RAW:
continue
if isinstance(each_var, fluid.framework.Parameter):
orig_para_shape[each_var.name] = tuple(each_var.desc.get_shape())
new_var = _clone_var_in_block_(load_block, each_var)
if filename is not None:
load_block.append_op(
type='load',
inputs={},
outputs={'Out': [new_var]},
attrs={'file_path': filename})
executor.run(load_prog)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file aims to use multiprocessing to do following process.
`
for data in reader():
yield func(data)
`
"""
#encoding=utf8
import numpy as np
import multiprocessing as mp
import traceback
from pgl.utils.logger import log
def mp_reader_mapper(reader, func, num_works=4):
"""
This function aims to use multiprocessing to do following process.
`
for data in reader():
yield func(data)
`
The data in_stream is the `reader`, the mapper is map the in_stream to
an out_stream.
Please ensure the `func` have specific return value, not `None`!
:param reader: the data iterator
:param func: the map func
:param num_works: number of works
:return: an new iterator
"""
def _read_into_pipe(func, conn):
"""
read into pipe, and use the `func` to get final data.
"""
while True:
data = conn.recv()
if data is None:
conn.send(None)
conn.close()
break
conn.send(func(data))
def pipe_reader():
"""pipe_reader"""
conns = []
all_process = []
for w in range(num_works):
parent_conn, child_conn = mp.Pipe()
conns.append(parent_conn)
p = mp.Process(target=_read_into_pipe, args=(func, child_conn))
p.start()
all_process.append(p)
data_iter = reader()
if not hasattr(data_iter, "__next__"):
__next__ = data_iter.next
else:
__next__ = data_iter.__next__
def next_data():
"""next_data"""
_next = None
try:
_next = __next__()
except StopIteration:
# log.debug(traceback.format_exc())
pass
except Exception:
log.debug(traceback.format_exc())
return _next
for i in range(num_works):
conns[i].send(next_data())
finish_num = 0
finish_flag = np.zeros(len(conns), dtype="int32")
while finish_num < num_works:
for conn_id, conn in enumerate(conns):
if finish_flag[conn_id] > 0:
continue
sample = conn.recv()
if sample is None:
finish_num += 1
conn.close()
finish_flag[conn_id] = 1
else:
yield sample
conns[conn_id].send(next_data())
return pipe_reader
device=3
CUDA_VISIBLE_DEVICES=$device \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python main.py \
--use_cuda \
--model TransE \
--data_dir ./data/FB15k \
--optimizer adam \
--batch_size=1024 \
--learning_rate=0.001 \
--epoch 200 \
--evaluate_per_iteration 200 \
--sample_workers 1 \
--margin 1.0 \
--nofilter True \
--neg_times 10 \
--neg_mode True
#--only_evaluate
# TransE FB15k
# -----Raw-Average-Results
# MeanRank: 214.94, MRR: 0.2051, Hits@1: 0.0929, Hits@3: 0.2343, Hits@10: 0.4458
# -----Filter-Average-Results
# MeanRank: 74.41, MRR: 0.3793, Hits@1: 0.2351, Hits@3: 0.4538, Hits@10: 0.6570
CUDA_VISIBLE_DEVICES=$device \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python main.py \
--use_cuda \
--model TransE \
--data_dir ./data/WN18 \
--optimizer adam \
--batch_size=1024 \
--learning_rate=0.001 \
--epoch 100 \
--evaluate_per_iteration 100 \
--sample_workers 1 \
--margin 4 \
--nofilter True \
--neg_times 10 \
--neg_mode True
# TransE WN18
# -----Raw-Average-Results
# MeanRank: 219.08, MRR: 0.3383, Hits@1: 0.0821, Hits@3: 0.5233, Hits@10: 0.7997
# -----Filter-Average-Results
# MeanRank: 207.72, MRR: 0.4631, Hits@1: 0.1349, Hits@3: 0.7708, Hits@10: 0.9315
#for prertrain
CUDA_VISIBLE_DEVICES=$device \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python main.py \
--use_cuda \
--model TransE \
--data_dir ./data/FB15k \
--optimizer adam \
--batch_size=512 \
--learning_rate=0.001 \
--epoch 30 \
--evaluate_per_iteration 30 \
--sample_workers 1 \
--margin 2.0 \
--nofilter True \
--noeval True \
--neg_times 10 \
--neg_mode True && \
CUDA_VISIBLE_DEVICES=$device \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python main.py \
--use_cuda \
--model TransR \
--data_dir ./data/FB15k \
--optimizer adam \
--batch_size=512 \
--learning_rate=0.001 \
--epoch 200 \
--evaluate_per_iteration 200 \
--sample_workers 1 \
--margin 2.0 \
--pretrain True \
--nofilter True \
--neg_times 10 \
--neg_mode True
# FB15k TransR 200, pretrain 20
# -----Raw-Average-Results
# MeanRank: 303.81, MRR: 0.1931, Hits@1: 0.0920, Hits@3: 0.2109, Hits@10: 0.4181
# -----Filter-Average-Results
# MeanRank: 156.30, MRR: 0.3663, Hits@1: 0.2318, Hits@3: 0.4352, Hits@10: 0.6231
# for pretrain
CUDA_VISIBLE_DEVICES=$device \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python main.py \
--use_cuda \
--model TransE \
--data_dir ./data/WN18 \
--optimizer adam \
--batch_size=512 \
--learning_rate=0.001 \
--epoch 30 \
--evaluate_per_iteration 30 \
--sample_workers 1 \
--margin 4.0 \
--nofilter True \
--noeval True \
--neg_times 10 \
--neg_mode True && \
CUDA_VISIBLE_DEVICES=$device \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python main.py \
--use_cuda \
--model TransR \
--data_dir ./data/WN18 \
--optimizer adam \
--batch_size=512 \
--learning_rate=0.001 \
--epoch 100 \
--evaluate_per_iteration 100 \
--sample_workers 1 \
--margin 4.0 \
--pretrain True \
--nofilter True \
--neg_times 10 \
--neg_mode True
# TransR WN18 100, pretrain 30
# -----Raw-Average-Results
# MeanRank: 321.41, MRR: 0.3706, Hits@1: 0.0955, Hits@3: 0.5906, Hits@10: 0.8099
# -----Filter-Average-Results
# MeanRank: 309.15, MRR: 0.5126, Hits@1: 0.1584, Hits@3: 0.8601, Hits@10: 0.9409
CUDA_VISIBLE_DEVICES=$device \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python main.py \
--use_cuda \
--model RotatE \
--data_dir ./data/FB15k \
--optimizer adam \
--batch_size=512 \
--learning_rate=0.001 \
--epoch 100 \
--evaluate_per_iteration 100 \
--sample_workers 10 \
--margin 8 \
--neg_times 10 \
--neg_mode True
# RotatE FB15k
# -----Raw-Average-Results
# MeanRank: 156.85, MRR: 0.2699, Hits@1: 0.1615, Hits@3: 0.3031, Hits@10: 0.5006
# -----Filter-Average-Results
# MeanRank: 53.35, MRR: 0.4776, Hits@1: 0.3537, Hits@3: 0.5473, Hits@10: 0.7062
CUDA_VISIBLE_DEVICES=$device \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python main.py \
--use_cuda \
--model RotatE \
--data_dir ./data/WN18 \
--optimizer adam \
--batch_size=512 \
--learning_rate=0.001 \
--epoch 100 \
--evaluate_per_iteration 100 \
--sample_workers 10 \
--margin 6 \
--neg_times 10 \
--neg_mode True
# RotaE WN18
# -----Raw-Average-Results
# MeanRank: 167.27, MRR: 0.6025, Hits@1: 0.4764, Hits@3: 0.6880, Hits@10: 0.8298
# -----Filter-Average-Results
# MeanRank: 155.23, MRR: 0.9145, Hits@1: 0.8843, Hits@3: 0.9412, Hits@10: 0.9570
# SGC: Simplifying Graph Convolutional Networks
[Simplifying Graph Convolutional Networks \(SGC\)](https://arxiv.org/pdf/1902.07153.pdf) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce SGC algorithms and reach the same level of indicators as the paper in citation network benchmarks.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle 1.5
- pgl
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | ---|
| Cora | 0.818 (paper: 0.810) | 0.0015s |
| Pubmed | 0.788 (paper: 0.789) | 0.0015s |
| Citeseer | 0.719 (paper: 0.719) | 0.0015s |
### How to run
For examples, use gpu to train SGC on cora dataset.
```
python sgc.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the training process of SGC model with StaticGraphWrapper.
"""
import os
import argparse
import numpy as np
import random
import time
import pgl
from pgl import data_loader
from pgl.utils.logger import log
from pgl.utils import paddle_helper
import paddle.fluid as fluid
def load(name):
"""Load dataset."""
if name == 'cora':
dataset = data_loader.CoraDataset()
elif name == "pubmed":
dataset = data_loader.CitationDataset("pubmed", symmetry_edges=False)
elif name == "citeseer":
dataset = data_loader.CitationDataset("citeseer", symmetry_edges=False)
else:
raise ValueError(name + " dataset doesn't exists")
return dataset
def expand_data_dim(dataset):
"""Expand the dimension of data."""
train_index = dataset.train_index
train_label = np.expand_dims(dataset.y[train_index], -1)
train_index = np.expand_dims(train_index, -1)
val_index = dataset.val_index
val_label = np.expand_dims(dataset.y[val_index], -1)
val_index = np.expand_dims(val_index, -1)
test_index = dataset.test_index
test_label = np.expand_dims(dataset.y[test_index], -1)
test_index = np.expand_dims(test_index, -1)
return {
'train_index': train_index,
'train_label': train_label,
'val_index': val_index,
'val_label': val_label,
'test_index': test_index,
'test_label': test_label,
}
def MessagePassing(gw, feature, num_layers, norm=None):
"""Precomputing message passing.
"""
def send_src_copy(src_feat, dst_feat, edge_feat):
"""send_src_copy
"""
return src_feat["h"]
for _ in range(num_layers):
if norm is not None:
feature = feature * norm
msg = gw.send(send_src_copy, nfeat_list=[("h", feature)])
feature = gw.recv(msg, "sum")
if norm is not None:
feature = feature * norm
return feature
def pre_gather(features, name_prefix, node_index_val):
"""Get features with respect to node index.
"""
node_index, init = paddle_helper.constant(
"%s_node_index" % (name_prefix), dtype='int32', value=node_index_val)
logits = fluid.layers.gather(features, node_index)
return logits, init
def calculate_loss(name, np_cached_h, node_label_val, num_classes, args):
"""Calculate loss function.
"""
initializer = []
const_cached_h, init = paddle_helper.constant(
"const_%s_cached_h" % name, dtype='float32', value=np_cached_h)
initializer.append(init)
node_label, init = paddle_helper.constant(
"%s_node_label" % (name), dtype='int64', value=node_label_val)
initializer.append(init)
output = fluid.layers.fc(const_cached_h,
size=num_classes,
bias_attr=args.bias,
name='fc')
loss, probs = fluid.layers.softmax_with_cross_entropy(
logits=output, label=node_label, return_softmax=True)
loss = fluid.layers.mean(loss)
acc = None
if name != 'train':
acc = fluid.layers.accuracy(input=probs, label=node_label, k=1)
return {
'loss': loss,
'acc': acc,
'probs': probs,
'initializer': initializer
}
def main(args):
""""Main function."""
dataset = load(args.dataset)
# normalize
indegree = dataset.graph.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
dataset.graph.node_feat["norm"] = np.expand_dims(norm, -1)
data = expand_data_dim(dataset)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
precompute_program = fluid.Program()
startup_program = fluid.Program()
train_program = fluid.Program()
val_program = train_program.clone(for_test=True)
test_program = train_program.clone(for_test=True)
# precompute message passing and gather
initializer = []
with fluid.program_guard(precompute_program, startup_program):
gw = pgl.graph_wrapper.StaticGraphWrapper(
name="graph", place=place, graph=dataset.graph)
cached_h = MessagePassing(
gw,
gw.node_feat["words"],
num_layers=args.num_layers,
norm=gw.node_feat['norm'])
train_cached_h, init = pre_gather(cached_h, 'train',
data['train_index'])
initializer.append(init)
val_cached_h, init = pre_gather(cached_h, 'val', data['val_index'])
initializer.append(init)
test_cached_h, init = pre_gather(cached_h, 'test', data['test_index'])
initializer.append(init)
exe = fluid.Executor(place)
gw.initialize(place)
for init in initializer:
init(place)
# get train features, val features and test features
np_train_cached_h, np_val_cached_h, np_test_cached_h = exe.run(
precompute_program,
feed={},
fetch_list=[train_cached_h, val_cached_h, test_cached_h],
return_numpy=True)
initializer = []
with fluid.program_guard(train_program, startup_program):
with fluid.unique_name.guard():
train_handle = calculate_loss('train', np_train_cached_h,
data['train_label'],
dataset.num_classes, args)
initializer += train_handle['initializer']
adam = fluid.optimizer.Adam(
learning_rate=args.lr,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=args.weight_decay))
adam.minimize(train_handle['loss'])
with fluid.program_guard(val_program, startup_program):
with fluid.unique_name.guard():
val_handle = calculate_loss('val', np_val_cached_h,
data['val_label'], dataset.num_classes,
args)
initializer += val_handle['initializer']
with fluid.program_guard(test_program, startup_program):
with fluid.unique_name.guard():
test_handle = calculate_loss('test', np_test_cached_h,
data['test_label'],
dataset.num_classes, args)
initializer += test_handle['initializer']
exe.run(startup_program)
for init in initializer:
init(place)
dur = []
for epoch in range(args.epochs):
if epoch >= 3:
t0 = time.time()
train_loss_t = exe.run(train_program,
feed={},
fetch_list=[train_handle['loss']],
return_numpy=True)[0]
if epoch >= 3:
time_per_epoch = 1.0 * (time.time() - t0)
dur.append(time_per_epoch)
val_loss_t, val_acc_t = exe.run(
val_program,
feed={},
fetch_list=[val_handle['loss'], val_handle['acc']],
return_numpy=True)
log.info("Epoch %d " % epoch + "(%.5lf sec) " % np.mean(
dur) + "Train Loss: %f " % train_loss_t + "Val Loss: %f " %
val_loss_t + "Val Acc: %f " % val_acc_t)
test_loss_t, test_acc_t = exe.run(
test_program,
feed={},
fetch_list=[test_handle['loss'], test_handle['acc']],
return_numpy=True)
log.info("Test Accuracy: %f" % test_acc_t)
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='SGC')
parser.add_argument(
"--dataset",
type=str,
default="cora",
help="dataset (cora, pubmed, citeseer)")
parser.add_argument("--use_cuda", action='store_true', help="use_cuda")
parser.add_argument(
"--seed", type=int, default=1667, help="global random seed")
parser.add_argument("--lr", type=float, default=0.2, help="learning rate")
parser.add_argument(
"--weight_decay",
type=float,
default=0.000005,
help="Weight for L2 loss")
parser.add_argument(
"--bias", action='store_true', default=False, help="flag to use bias")
parser.add_argument(
"--epochs", type=int, default=200, help="number of training epochs")
parser.add_argument(
"--num_layers", type=int, default=2, help="number of SGC layers")
args = parser.parse_args()
log.info(args)
main(args)
......@@ -11,7 +11,7 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### Performance
......@@ -19,11 +19,11 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gat | Improvement |
| --- | --- | --- |---| --- | --- |
| Cora | ~83% | 0.0145s | 0.0119s | 0.0175s | 1.47x |
| Pubmed | ~78% | 0.0352s | 0.0193s |0.0295s | 1.53x |
| Citeseer | ~70% | 0.0148s | 0.0124s |0.0253s | 2.04x |
| Dataset | Accuracy | epoch time | examples/gat | Improvement |
| --- | --- | --- | --- | --- |
| Cora | ~83% | 0.0119s | 0.0175s | 1.47x |
| Pubmed | ~78% | 0.0193s |0.0295s | 1.53x |
| Citeseer | ~70% | 0.0124s |0.0253s | 2.04x |
### How to run
......
......@@ -84,7 +84,7 @@ def main(args):
initializer = []
with fluid.program_guard(train_program, startup_program):
train_node_index, init = paddle_helper.constant(
"train_node_index", dtype="int32", value=train_index)
"train_node_index", dtype="int64", value=train_index)
initializer.append(init)
train_node_label, init = paddle_helper.constant(
......@@ -103,7 +103,7 @@ def main(args):
with fluid.program_guard(val_program, startup_program):
val_node_index, init = paddle_helper.constant(
"val_node_index", dtype="int32", value=val_index)
"val_node_index", dtype="int64", value=val_index)
initializer.append(init)
val_node_label, init = paddle_helper.constant(
......@@ -119,7 +119,7 @@ def main(args):
with fluid.program_guard(test_program, startup_program):
test_node_index, init = paddle_helper.constant(
"test_node_index", dtype="int32", value=test_index)
"test_node_index", dtype="int64", value=test_index)
initializer.append(init)
test_node_label, init = paddle_helper.constant(
......
......@@ -10,7 +10,7 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
### Dependencies
- paddlepaddle>=1.4 (The speed can be faster in 1.5.)
- paddlepaddle>=1.6
- pgl
### Performance
......@@ -18,12 +18,11 @@ The datasets contain three citation networks: CORA, PUBMED, CITESEER. The detail
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.4 <br> (epoch time) | Speed with paddle 1.5 <br> (epoch time)| examples/gcn | Improvement |
| --- | --- | --- |---| --- | --- |
| Cora | ~81% | 0.0053s | 0.0047s | 0.0104s | 2.21x |
| Pubmed | ~79% | 0.0105s | 0.0049s |0.0154s | 3.14x |
| Citeseer | ~71% | 0.0051s | 0.0045s |0.0177s | 3.93x |
| Dataset | Accuracy | epoch time | examples/gcn | Improvement |
| --- | --- | --- | --- | --- |
| Cora | ~81% | 0.0047s | 0.0104s | 2.21x |
| Pubmed | ~79% | 0.0049s |0.0154s | 3.14x |
| Citeseer | ~71% | 0.0045s |0.0177s | 3.93x |
### How to run
......
......@@ -85,7 +85,7 @@ def main(args):
initializer = []
with fluid.program_guard(train_program, startup_program):
train_node_index, init = paddle_helper.constant(
"train_node_index", dtype="int32", value=train_index)
"train_node_index", dtype="int64", value=train_index)
initializer.append(init)
train_node_label, init = paddle_helper.constant(
......@@ -104,7 +104,7 @@ def main(args):
with fluid.program_guard(val_program, startup_program):
val_node_index, init = paddle_helper.constant(
"val_node_index", dtype="int32", value=val_index)
"val_node_index", dtype="int64", value=val_index)
initializer.append(init)
val_node_label, init = paddle_helper.constant(
......@@ -120,7 +120,7 @@ def main(args):
with fluid.program_guard(test_program, startup_program):
test_node_index, init = paddle_helper.constant(
"test_node_index", dtype="int32", value=test_index)
"test_node_index", dtype="int64", value=test_index)
initializer.append(init)
test_node_label, init = paddle_helper.constant(
......
# STGCN: Spatio-Temporal Graph Convolutional Network
[Spatio-Temporal Graph Convolutional Network \(STGCN\)](https://arxiv.org/pdf/1709.04875.pdf) is a novel deep learning framework to tackle time series prediction problem. Based on PGL, we reproduce STGCN algorithms to predict new confirmed patients in some cities with the historical immigration records.
### Datasets
You can make your customized dataset by the following format:
* input.csv: Historical immigration records with shape of [num\_time\_steps * num\_cities].
* output.csv: New confirmed patients records with shape of [num\_time\_steps * num\_cities].
* W.csv: Weighted Adjacency Matrix with shape of [num\_cities * num\_cities].
* city.csv: Each line is a number and the corresponding city name.
### Dependencies
- paddlepaddle 1.6
- pgl 1.0.0
### How to run
For examples, use gpu to train STGCN on your dataset.
```
python main.py --use_cuda --input_file dataset/input_csv --label_file dataset/output.csv --adj_mat_file dataset/W.csv --city_file dataset/city.csv
```
#### Hyperparameters
- n\_route: Number of city.
- n\_his: "n\_his" time steps of previous observations of historical immigration records.
- n\_pred: Next "n\_pred" time steps of New confirmed patients records.
- Ks: Number of GCN layers.
- Kt: Kernel size of temporal convolution.
- use\_cuda: Use gpu if assign use\_cuda.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""__init__"""
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""data processing
"""
import numpy as np
import pandas as pd
from utils.math_utils import z_score
class Dataset(object):
"""Dataset
"""
def __init__(self, data, stats):
self.__data = data
self.mean = stats['mean']
self.std = stats['std']
def get_data(self, type): # type: train, val or test
return self.__data[type]
def get_stats(self):
return {'mean': self.mean, 'std': self.std}
def get_len(self, type):
return len(self.__data[type])
def z_inverse(self, type):
return self.__data[type] * self.std + self.mean
def seq_gen(len_seq, data_seq, offset, n_frame, n_route, day_slot, C_0=1):
"""Generate data in the form of standard sequence unit."""
n_slot = day_slot - n_frame + 1
tmp_seq = np.zeros((len_seq * n_slot, n_frame, n_route, C_0))
for i in range(len_seq):
for j in range(n_slot):
sta = (i + offset) * day_slot + j
end = sta + n_frame
tmp_seq[i * n_slot + j, :, :, :] = np.reshape(
data_seq[sta:end, :], [n_frame, n_route, C_0])
return tmp_seq
def adj_matrx_gen_custom(input_file, city_file):
"""genenrate Adjacency Matrix from file
"""
print("generate adj_matrix data (take long time)...")
# data
df = pd.read_csv(
input_file,
sep='\t',
names=['date', '迁出省份', '迁出城市', '迁入省份', '迁入城市', '人数'])
# 只需要2020年的数据
df['date'] = pd.to_datetime(df['date'], format="%Y%m%d")
df = df.set_index('date')
df = df['2020']
city_df = pd.read_csv(city_file)
# 剔除武汉
city_df = city_df.drop(0)
num = len(city_df)
matrix = np.zeros([num, num])
for i in city_df['city']:
for j in city_df['city']:
if (i == j):
continue
# 选出从i到j的每日人数
cut = df[df['迁出城市'].str.contains(i)]
cut = cut[cut['迁入城市'].str.contains(j)]
# 求均值作为权重
average = cut['人数'].mean()
# 赋值给matrix
i_index = int(city_df[city_df['city'] == i]['num']) - 1
j_index = int(city_df[city_df['city'] == j]['num']) - 1
matrix[i_index, j_index] = average
np.savetxt("dataset/W_74.csv", matrix, delimiter=",")
def data_gen_custom(input_file, output_file, city_file, n, n_his, n_pred,
n_config):
"""data_gen_custom"""
print("generate training data...")
# data
df = pd.read_csv(
input_file,
sep='\t',
names=['date', '迁出省份', '迁出城市', '迁入省份', '迁入城市', '人数'])
# 只需要2020年的数据
df['date'] = pd.to_datetime(df['date'], format="%Y%m%d")
df = df.set_index('date')
df = df['2020']
city_df = pd.read_csv(city_file)
input_df = pd.DataFrame()
out_df_wuhan = df[df['迁出城市'].str.contains('武汉')]
for i in city_df['city']:
# 筛选迁入城市
in_df_i = out_df_wuhan[out_df_wuhan['迁入城市'].str.contains(i)]
# 确保按时间升序
# in_df_i.sort_values("date",inplace=True)
# 按时间插入
in_df_i.reset_index(drop=True, inplace=True)
input_df[i] = in_df_i['人数']
# 替换Nan值
input_df = input_df.replace(np.nan, 0)
x = input_df
y = pd.read_csv(output_file)
# 删除第1列
x.drop(
x.columns[x.columns.str.contains(
'unnamed', case=False)],
axis=1,
inplace=True)
y = y.drop(columns=['date'])
# 剔除迁入武汉的数据
x = x.drop(columns=['武汉'])
y = y.drop(columns=['武汉'])
# param
n_val, n_test = n_config
n_train = len(y) - n_val - n_test - 2
# (?,26,74,1)
df = pd.DataFrame(columns=x.columns)
for i in range(len(y) - n_pred + 1):
df = df.append(x[i:i + n_his])
df = df.append(y[i:i + n_pred])
data = df.values.reshape(-1, n_his + n_pred, n,
1) # n == num_nodes == city num
x_stats = {'mean': np.mean(data), 'std': np.std(data)}
x_train = data[:n_train]
x_val = data[n_train:n_train + n_val]
x_test = data[n_train + n_val:]
x_data = {'train': x_train, 'val': x_val, 'test': x_test}
dataset = Dataset(x_data, x_stats)
print("generate successfully!")
return dataset
def data_gen_mydata(input_file, label_file, n, n_his, n_pred, n_config):
"""data processing
"""
# data
x = pd.read_csv(input_file)
y = pd.read_csv(label_file)
x = x.drop(columns=['date'])
y = y.drop(columns=['date'])
x = x.drop(columns=['武汉'])
y = y.drop(columns=['武汉'])
# param
n_val, n_test = n_config
n_train = len(y) - n_val - n_test - 2
# (?,26,74,1)
df = pd.DataFrame(columns=x.columns)
for i in range(len(y) - n_pred + 1):
df = df.append(x[i:i + n_his])
df = df.append(y[i:i + n_pred])
data = df.values.reshape(-1, n_his + n_pred, n, 1)
x_stats = {'mean': np.mean(data), 'std': np.std(data)}
x_train = data[:n_train]
x_val = data[n_train:n_train + n_val]
x_test = data[n_train + n_val:]
x_data = {'train': x_train, 'val': x_val, 'test': x_test}
dataset = Dataset(x_data, x_stats)
return dataset
def data_gen(file_path, data_config, n_route, n_frame=21, day_slot=288):
"""Source file load and dataset generation."""
n_train, n_val, n_test = data_config
# generate training, validation and test data
try:
data_seq = pd.read_csv(file_path, header=None).values
except FileNotFoundError:
print(f'ERROR: input file was not found in {file_path}.')
seq_train = seq_gen(n_train, data_seq, 0, n_frame, n_route, day_slot)
seq_val = seq_gen(n_val, data_seq, n_train, n_frame, n_route, day_slot)
seq_test = seq_gen(n_test, data_seq, n_train + n_val, n_frame, n_route,
day_slot)
# x_stats: dict, the stats for the train dataset, including the value of mean and standard deviation.
x_stats = {'mean': np.mean(seq_train), 'std': np.std(seq_train)}
# x_train, x_val, x_test: np.array, [sample_size, n_frame, n_route, channel_size].
x_train = z_score(seq_train, x_stats['mean'], x_stats['std'])
x_val = z_score(seq_val, x_stats['mean'], x_stats['std'])
x_test = z_score(seq_test, x_stats['mean'], x_stats['std'])
x_data = {'train': x_train, 'val': x_val, 'test': x_test}
dataset = Dataset(x_data, x_stats)
return dataset
def gen_batch(inputs, batch_size, dynamic_batch=False, shuffle=False):
"""Data iterator in batch.
Args:
inputs: np.ndarray, [len_seq, n_frame, n_route, C_0], standard sequence units.
batch_size: int, size of batch.
dynamic_batch: bool, whether changes the batch size in the last batch
if its length is less than the default.
shuffle: bool, whether shuffle the batches.
"""
len_inputs = len(inputs)
if shuffle:
idx = np.arange(len_inputs)
np.random.shuffle(idx)
for start_idx in range(0, len_inputs, batch_size):
end_idx = start_idx + batch_size
if end_idx > len_inputs:
if dynamic_batch:
end_idx = len_inputs
else:
break
if shuffle:
slide = idx[start_idx:end_idx]
else:
slide = slice(start_idx, end_idx)
yield inputs[slide]
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PGL Graph
"""
import sys
import os
import numpy as np
import pandas as pd
from pgl.graph import Graph
def weight_matrix(file_path, sigma2=0.1, epsilon=0.5, scaling=True):
"""Load weight matrix function."""
try:
W = pd.read_csv(file_path, header=None).values
except FileNotFoundError:
print(f'ERROR: input file was not found in {file_path}.')
# check whether W is a 0/1 matrix.
if set(np.unique(W)) == {0, 1}:
print('The input graph is a 0/1 matrix; set "scaling" to False.')
scaling = False
if scaling:
n = W.shape[0]
W = W / 10000.
W2, W_mask = W * W, np.ones([n, n]) - np.identity(n)
# refer to Eq.10
return np.exp(-W2 / sigma2) * (
np.exp(-W2 / sigma2) >= epsilon) * W_mask
else:
return W
class GraphFactory(object):
"""GraphFactory"""
def __init__(self, args):
self.args = args
self.adj_matrix = weight_matrix(self.args.adj_mat_file)
L = np.eye(self.adj_matrix.shape[0]) + self.adj_matrix
D = np.sum(self.adj_matrix, axis=1)
# L = D - self.adj_matrix
# import ipdb; ipdb.set_trace()
edges = []
weights = []
for i in range(self.adj_matrix.shape[0]):
for j in range(self.adj_matrix.shape[1]):
edges.append([i, j])
weights.append(L[i][j])
self.edges = np.array(edges, dtype=np.int64)
self.weights = np.array(weights, dtype=np.float32).reshape(-1, 1)
self.norm = np.zeros_like(D, dtype=np.float32)
self.norm[D > 0] = np.power(D[D > 0], -0.5)
self.norm = self.norm.reshape(-1, 1)
def build_graph(self, x_batch):
"""build graph"""
B, T, n, _ = x_batch.shape
batch = B * T
batch_edges = []
for i in range(batch):
batch_edges.append(self.edges + (i * n))
batch_edges = np.vstack(batch_edges)
num_nodes = B * T * n
node_feat = {'norm': np.tile(self.norm, [batch, 1])}
edge_feat = {'weights': np.tile(self.weights, [batch, 1])}
graph = Graph(
num_nodes=num_nodes,
edges=batch_edges,
node_feat=node_feat,
edge_feat=edge_feat)
return graph
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file implement the training process of STGCN model.
"""
import os
import sys
import time
import argparse
import numpy as np
import paddle.fluid as fluid
import paddle.fluid.layers as fl
import pgl
from pgl.utils.logger import log
from data_loader.data_utils import data_gen_mydata, gen_batch
from data_loader.graph import GraphFactory
from models.model import STGCNModel
from models.tester import model_inference, model_test
def main(args):
"""main"""
PeMS = data_gen_mydata(args.input_file, args.label_file, args.n_route,
args.n_his, args.n_pred, (args.n_val, args.n_test))
log.info(PeMS.get_stats())
log.info(PeMS.get_len('train'))
gf = GraphFactory(args)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.GraphWrapper(
"gw",
place,
node_feat=[('norm', [None, 1], "float32")],
edge_feat=[('weights', [None, 1], "float32")])
model = STGCNModel(args, gw)
train_loss, y_pred = model.forward()
infer_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
epoch_step = int(PeMS.get_len('train') / args.batch_size) + 1
lr = fl.exponential_decay(
learning_rate=args.lr,
decay_steps=5 * epoch_step,
decay_rate=0.7,
staircase=True)
if args.opt == 'RMSProp':
train_op = fluid.optimizer.RMSPropOptimizer(lr).minimize(
train_loss)
elif args.opt == 'ADAM':
train_op = fluid.optimizer.Adam(lr).minimize(train_loss)
exe = fluid.Executor(place)
exe.run(startup_program)
if args.inf_mode == 'sep':
# for inference mode 'sep', the type of step index is int.
step_idx = args.n_pred - 1
tmp_idx = [step_idx]
min_val = min_va_val = np.array([4e1, 1e5, 1e5])
elif args.inf_mode == 'merge':
# for inference mode 'merge', the type of step index is np.ndarray.
step_idx = tmp_idx = np.arange(3, args.n_pred + 1, 3) - 1
min_val = min_va_val = np.array([4e1, 1e5, 1e5]) * len(step_idx)
else:
raise ValueError(f'ERROR: test mode "{args.inf_mode}" is not defined.')
step = 0
for epoch in range(1, args.epochs + 1):
for idx, x_batch in enumerate(
gen_batch(
PeMS.get_data('train'),
args.batch_size,
dynamic_batch=True,
shuffle=True)):
x = np.array(x_batch[:, 0:args.n_his, :, :], dtype=np.float32)
graph = gf.build_graph(x)
feed = gw.to_feed(graph)
feed['input'] = np.array(
x_batch[:, 0:args.n_his + 1, :, :], dtype=np.float32)
b_loss, b_lr = exe.run(train_program,
feed=feed,
fetch_list=[train_loss, lr])
if idx % 5 == 0:
log.info("epoch %d | step %d | lr %.6f | loss %.6f" %
(epoch, idx, b_lr[0], b_loss[0]))
min_va_val, min_val = \
model_inference(exe, gw, gf, infer_program, y_pred, PeMS, args, \
step_idx, min_va_val, min_val)
for ix in tmp_idx:
va, te = min_va_val[ix - 2:ix + 1], min_val[ix - 2:ix + 1]
print(f'Time Step {ix + 1}: '
f'MAPE {va[0]:7.3%}, {te[0]:7.3%}; '
f'MAE {va[1]:4.3f}, {te[1]:4.3f}; '
f'RMSE {va[2]:6.3f}, {te[2]:6.3f}.')
if epoch % 5 == 0:
model_test(exe, gw, gf, infer_program, y_pred, PeMS, args)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('--n_route', type=int, default=74)
parser.add_argument('--n_his', type=int, default=23)
parser.add_argument('--n_pred', type=int, default=3)
parser.add_argument('--batch_size', type=int, default=10)
parser.add_argument('--epochs', type=int, default=100)
parser.add_argument('--save', type=int, default=10)
parser.add_argument('--Ks', type=int, default=3) #equal to num_layers
parser.add_argument('--Kt', type=int, default=3)
parser.add_argument('--lr', type=float, default=1e-2)
parser.add_argument('--keep_prob', type=float, default=1.0)
parser.add_argument('--opt', type=str, default='RMSProp')
parser.add_argument('--inf_mode', type=str, default='sep')
parser.add_argument('--input_file', type=str, default='dataset/input.csv')
parser.add_argument('--label_file', type=str, default='dataset/output.csv')
parser.add_argument(
'--city_file', type=str, default='dataset/crawl_list.csv')
parser.add_argument('--adj_mat_file', type=str, default='dataset/W_74.csv')
parser.add_argument('--output_path', type=str, default='./outputs/')
parser.add_argument('--n_val', type=str, default=1)
parser.add_argument('--n_test', type=str, default=1)
parser.add_argument('--use_cuda', action='store_true')
args = parser.parse_args()
blocks = [[1, 32, 64], [64, 32, 128]]
args.blocks = blocks
log.info(args)
if not os.path.exists(args.output_path):
os.makedirs(args.output_path)
main(args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""This file implement the STGCN model.
"""
import numpy as np
import paddle.fluid as fluid
import paddle.fluid.layers as fl
import pgl
class STGCNModel(object):
"""Implementation of Spatio-Temporal Graph Convolutional Networks"""
def __init__(self, args, gw):
self.args = args
self.gw = gw
self.input = fl.data(
name="input",
shape=[None, args.n_his + 1, args.n_route, 1],
dtype="float32")
def forward(self):
"""forward"""
x = self.input[:, 0:self.args.n_his, :, :]
# Ko>0: kernel size of temporal convolution in the output layer.
Ko = self.args.n_his
# ST-Block
for i, channels in enumerate(self.args.blocks):
x = self.st_conv_block(
x,
self.args.Ks,
self.args.Kt,
channels,
"st_conv_%d" % i,
self.args.keep_prob,
act_func='GLU')
# output layer
if Ko > 1:
y = self.output_layer(x, Ko, 'output_layer')
else:
raise ValueError(f'ERROR: kernel size Ko must be greater than 1, \
but received "{Ko}".')
label = self.input[:, self.args.n_his:self.args.n_his + 1, :, :]
train_loss = fl.reduce_sum((y - label) * (y - label))
single_pred = y[:, 0, :, :] # shape: [batch, n, 1]
return train_loss, single_pred
def st_conv_block(self,
x,
Ks,
Kt,
channels,
name,
keep_prob,
act_func='GLU'):
"""Spatio-Temporal convolution block"""
c_si, c_t, c_oo = channels
x_s = self.temporal_conv_layer(
x, Kt, c_si, c_t, "%s_tconv_in" % name, act_func=act_func)
x_t = self.spatio_conv_layer(x_s, Ks, c_t, c_t, "%s_sonv" % name)
x_o = self.temporal_conv_layer(x_t, Kt, c_t, c_oo,
"%s_tconv_out" % name)
x_ln = fl.layer_norm(x_o)
return fl.dropout(x_ln, dropout_prob=(1.0 - keep_prob))
def temporal_conv_layer(self, x, Kt, c_in, c_out, name, act_func='relu'):
"""Temporal convolution layer"""
_, T, n, _ = x.shape
if c_in > c_out:
x_input = fl.conv2d(
input=x,
num_filters=c_out,
filter_size=[1, 1],
stride=[1, 1],
padding="SAME",
data_format="NHWC",
param_attr=fluid.ParamAttr(name="%s_conv2d_1" % name))
elif c_in < c_out:
# if the size of input channel is less than the output,
# padding x to the same size of output channel.
pad = fl.fill_constant_batch_size_like(
input=x,
shape=[-1, T, n, c_out - c_in],
dtype="float32",
value=0.0)
x_input = fl.concat([x, pad], axis=3)
else:
x_input = x
# x_input = x_input[:, Kt - 1:T, :, :]
if act_func == 'GLU':
# gated liner unit
bt_init = fluid.initializer.ConstantInitializer(value=0.0)
bt = fl.create_parameter(
shape=[2 * c_out],
dtype="float32",
attr=fluid.ParamAttr(
name="%s_bt" % name, trainable=True, initializer=bt_init),
)
x_conv = fl.conv2d(
input=x,
num_filters=2 * c_out,
filter_size=[Kt, 1],
stride=[1, 1],
padding="SAME",
data_format="NHWC",
param_attr=fluid.ParamAttr(name="%s_conv2d_wt" % name))
x_conv = x_conv + bt
return (x_conv[:, :, :, 0:c_out] + x_input
) * fl.sigmoid(x_conv[:, :, :, -c_out:])
else:
bt_init = fluid.initializer.ConstantInitializer(value=0.0)
bt = fl.create_parameter(
shape=[c_out],
dtype="float32",
attr=fluid.ParamAttr(
name="%s_bt" % name, trainable=True, initializer=bt_init),
)
x_conv = fl.conv2d(
input=x,
num_filters=c_out,
filter_size=[Kt, 1],
stride=[1, 1],
padding="SAME",
data_format="NHWC",
param_attr=fluid.ParamAttr(name="%s_conv2d_wt" % name))
x_conv = x_conv + bt
if act_func == "linear":
return x_conv
elif act_func == "sigmoid":
return fl.sigmoid(x_conv)
elif act_func == "relu":
return fl.relu(x_conv + x_input)
else:
raise ValueError(
f'ERROR: activation function "{act_func}" is not defined.')
def spatio_conv_layer(self, x, Ks, c_in, c_out, name):
"""Spatio convolution layer"""
_, T, n, _ = x.shape
if c_in > c_out:
x_input = fl.conv2d(
input=x,
num_filters=c_out,
filter_size=[1, 1],
stride=[1, 1],
padding="SAME",
data_format="NHWC",
param_attr=fluid.ParamAttr(name="%s_conv2d_1" % name))
elif c_in < c_out:
# if the size of input channel is less than the output,
# padding x to the same size of output channel.
pad = fl.fill_constant_batch_size_like(
input=x,
shape=[-1, T, n, c_out - c_in],
dtype="float32",
value=0.0)
x_input = fl.concat([x, pad], axis=3)
else:
x_input = x
for i in range(Ks):
# x_input shape: [B,T, num_nodes, c_out]
x_input = fl.reshape(x_input, [-1, c_out])
x_input = self.message_passing(
self.gw,
x_input,
name="%s_mp_%d" % (name, i),
norm=self.gw.node_feat["norm"])
x_input = fl.fc(x_input,
size=c_out,
bias_attr=False,
param_attr=fluid.ParamAttr(name="%s_gcn_fc_%d" %
(name, i)))
bias = fluid.layers.create_parameter(
shape=[c_out],
dtype='float32',
is_bias=True,
name='%s_gcn_bias_%d' % (name, i))
x_input = fluid.layers.elementwise_add(x_input, bias, act="relu")
x_input = fl.reshape(x_input, [-1, T, n, c_out])
return x_input
def message_passing(self, gw, feature, name, norm=None):
"""Message passing layer"""
def send_src_copy(src_feat, dst_feat, edge_feat):
"""send function"""
return src_feat["h"] * edge_feat['w']
if norm is not None:
feature = feature * norm
msg = gw.send(
send_src_copy,
nfeat_list=[("h", feature)],
efeat_list=[('w', gw.edge_feat['weights'])])
output = gw.recv(msg, "sum")
if norm is not None:
output = output * norm
return output
def output_layer(self, x, T, name, act_func='GLU'):
"""Output layer"""
_, _, n, channel = x.shape
# maps multi-steps to one.
x_i = self.temporal_conv_layer(
x=x,
Kt=T,
c_in=channel,
c_out=channel,
name="%s_in" % name,
act_func=act_func)
x_ln = fl.layer_norm(x_i)
x_o = self.temporal_conv_layer(
x=x_ln,
Kt=1,
c_in=channel,
c_out=channel,
name="%s_out" % name,
act_func='sigmoid')
# maps multi-channels to one.
x_fc = self.fully_con_layer(
x=x_o, n=n, channel=channel, name="%s_fc" % name)
return x_fc
def fully_con_layer(self, x, n, channel, name):
"""Fully connected layer"""
bt_init = fluid.initializer.ConstantInitializer(value=0.0)
bt = fl.create_parameter(
shape=[n, 1],
dtype="float32",
attr=fluid.ParamAttr(
name="%s_bt" % name, trainable=True, initializer=bt_init), )
x_conv = fl.conv2d(
input=x,
num_filters=1,
filter_size=[1, 1],
stride=[1, 1],
padding="SAME",
data_format="NHWC",
param_attr=fluid.ParamAttr(name="%s_conv2d" % name))
x_conv = x_conv + bt
return x_conv
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""This file implement the testing process of STGCN model.
"""
import os
import sys
import time
import argparse
import numpy as np
import pandas as pd
import paddle.fluid as fluid
import paddle.fluid.layers as fl
import pgl
from pgl.utils.logger import log
from data_loader.data_utils import gen_batch
from utils.math_utils import evaluation
def multi_pred(exe, gw, gf, program, y_pred, seq, batch_size, \
n_his, n_pred, step_idx, dynamic_batch=True):
"""multi step prediction"""
pred_list = []
for i in gen_batch(
seq, min(batch_size, len(seq)), dynamic_batch=dynamic_batch):
# Note: use np.copy() to avoid the modification of source data.
test_seq = np.copy(i[:, 0:n_his + 1, :, :]).astype(np.float32)
graph = gf.build_graph(i[:, 0:n_his, :, :])
feed = gw.to_feed(graph)
step_list = []
for j in range(n_pred):
feed['input'] = test_seq
pred = exe.run(program, feed=feed, fetch_list=[y_pred])
if isinstance(pred, list):
pred = np.array(pred[0])
test_seq[:, 0:n_his - 1, :, :] = test_seq[:, 1:n_his, :, :]
test_seq[:, n_his - 1, :, :] = pred
step_list.append(pred)
pred_list.append(step_list)
# pred_array -> [n_pred, len(seq), n_route, C_0)
pred_array = np.concatenate(pred_list, axis=1)
return pred_array, pred_array.shape[1]
def model_inference(exe, gw, gf, program, pred, inputs, args, step_idx,
min_va_val, min_val):
"""inference model"""
x_val, x_test, x_stats = inputs.get_data('val'), inputs.get_data(
'test'), inputs.get_stats()
if args.n_his + args.n_pred > x_val.shape[1]:
raise ValueError(
f'ERROR: the value of n_pred "{args.n_pred}" exceeds the length limit.'
)
# y_val shape: [n_pred, len(x_val), n_route, C_0)
y_val, len_val = multi_pred(exe, gw, gf, program, pred, \
x_val, args.batch_size, args.n_his, args.n_pred, step_idx)
evl_val = evaluation(x_val[0:len_val, step_idx + args.n_his, :, :],
y_val[step_idx], x_stats)
# chks: indicator that reflects the relationship of values between evl_val and min_va_val.
chks = evl_val < min_va_val
# update the metric on test set, if model's performance got improved on the validation.
if sum(chks):
min_va_val[chks] = evl_val[chks]
y_pred, len_pred = multi_pred(exe, gw, gf, program, pred, \
x_test, args.batch_size, args.n_his, args.n_pred, step_idx)
evl_pred = evaluation(x_test[0:len_pred, step_idx + args.n_his, :, :],
y_pred[step_idx], x_stats)
min_val = evl_pred
return min_va_val, min_val
def model_test(exe, gw, gf, program, pred, inputs, args):
"""test model"""
if args.inf_mode == 'sep':
# for inference mode 'sep', the type of step index is int.
step_idx = args.n_pred - 1
tmp_idx = [step_idx]
elif args.inf_mode == 'merge':
# for inference mode 'merge', the type of step index is np.ndarray.
step_idx = tmp_idx = np.arange(3, args.n_pred + 1, 3) - 1
print(step_idx)
else:
raise ValueError(f'ERROR: test mode "{args.inf_mode}" is not defined.')
x_test, x_stats = inputs.get_data('test'), inputs.get_stats()
y_test, len_test = multi_pred(exe, gw, gf, program, pred, \
x_test, args.batch_size, args.n_his, args.n_pred, step_idx)
# save result
gt = x_test[0:len_test, args.n_his:, :, :].reshape(-1, args.n_route)
y_pred = y_test.reshape(-1, args.n_route)
city_df = pd.read_csv(args.city_file)
city_df = city_df.drop(0)
np.savetxt(
os.path.join(args.output_path, "groundtruth.csv"),
gt.astype(np.int32),
fmt='%d',
delimiter=',',
header=",".join(city_df['city']))
np.savetxt(
os.path.join(args.output_path, "prediction.csv"),
y_pred.astype(np.int32),
fmt='%d',
delimiter=",",
header=",".join(city_df['city']))
for i in range(step_idx + 1):
evl = evaluation(x_test[0:len_test, step_idx + args.n_his, :, :],
y_test[i], x_stats)
for ix in tmp_idx:
te = evl[ix - 2:ix + 1]
print(
f'Time Step {i + 1}: MAPE {te[0]:7.3%}; MAE {te[1]:4.3f}; RMSE {te[2]:6.3f}.'
)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Evaluation"""
import os
import sys
import time
import argparse
import numpy as np
def z_score(x, mean, std):
"""z_score"""
return (x - mean) / std
def z_inverse(x, mean, std):
"""The inverse of function z_score"""
return x * std + mean
def MAPE(v, v_):
"""Mean absolute percentage error."""
return np.mean(np.abs(v_ - v) / (v + 1e-5))
def RMSE(v, v_):
"""Mean squared error."""
return np.sqrt(np.mean((v_ - v)**2))
def MAE(v, v_):
"""Mean absolute error."""
return np.mean(np.abs(v_ - v))
def evaluation(y, y_, x_stats):
"""Calculate MAPE, MAE and RMSE between ground truth and prediction."""
dim = len(y_.shape)
if dim == 3:
# single_step case
v = z_inverse(y, x_stats['mean'], x_stats['std'])
v_ = z_inverse(y_, x_stats['mean'], x_stats['std'])
return np.array([MAPE(v, v_), MAE(v, v_), RMSE(v, v_)])
else:
# multi_step case
tmp_list = []
# y -> [time_step, batch_size, n_route, 1]
y = np.swapaxes(y, 0, 1)
# recursively call
for i in range(y_.shape[0]):
tmp_res = evaluation(y[i], y_[i], x_stats)
tmp_list.append(tmp_res)
return np.concatenate(tmp_list, axis=-1)
# struc2vec: Learning Node Representations from Structural Identity
[Struc2vec](https://arxiv.org/abs/1704.03165) is is a concept of symmetry in which network nodes are identified according to the network structure and their relationship to other nodes. A novel and flexible framework for learning latent representations is proposed in the paper of struc2vec. We reproduce Struc2vec algorithm in the PGL.
### DataSet
The paper of use air-traffic network to valid algorithm of Struc2vec.
The each edge in the dataset indicate that having one flight between the airports. Using the the connection between the airports to predict the level of activity. The following dataset will be used to valid the algorithm accuracy.Data collected from the Bureau of Transportation Statistics2 from January to October, 2016. The network has 1,190 nodes, 13,599 edges (diameter is 8). [Link](https://www.transtats.bts.gov/)
- usa-airports.edgelist
- labels-usa-airports.txt
### Dependencies
If use want to use the struc2vec model in pgl, please install the gensim, pathos, fastdtw additional.
- paddlepaddle>=1.6
- pgl
- gensim
- pathos
- fastdtw
### How to use
For examples, we want to train and valid the Struc2vec model on American airpot dataset
> python struc2vec.py --edge_file data/usa-airports.edgelist --label_file data/labels-usa-airports.txt --train True --valid True --opt2 True
### Hyperparameters
| Args| Meaning|
| ------------- | ------------- |
| edge_file | input file name for edges|
| label_file | input file name for node label|
| emb_file | input file name for node label|
| walk_depth| The step3 for random walk|
| opt1| The flag to open optimization 1 to reduce time cost|
| opt2| The flag to open optimization 2 to reduce time cost|
| w2v_emb_size| The dims of output the word2vec embedding|
| w2v_window_size| The context length of word2vec|
| w2v_epoch| The num of epoch to train the model.|
| train| The flag to run the struc2vec algorithm to get the w2v embedding|
| valid| The flag to use the w2v embedding to valid the classification result|
| num_class| The num of class in classification model to be trained|
### Experiment results
| Dataset | Model | Metric | PGL Result | Paper repo Result |
| ------------- | ------------- |------------- |------------- |------------- |
| American airport dataset | Struc2vec without time cost optimization| ACC |0.6483|0.6340|
| American airport dataset | Struc2vec with optimization 1| ACC |0.6466|0.6242|
| American airport dataset | Struc2vec with optimization 2| ACC |0.6252|0.6241|
| American airport dataset | Struc2vec with optimization1&2| ACC |0.6226|0.6083|
"""
classify.py
"""
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import paddle
import paddle.fluid as fluid
def build_lr_model(args):
"""
Build the LR model to train.
"""
emb_x = fluid.layers.data(
name="emb_x", dtype='float32', shape=[args.w2v_emb_size])
label = fluid.layers.data(name="label_y", dtype='int64', shape=[1])
logits = fluid.layers.fc(input=emb_x,
size=args.num_class,
act=None,
name='classification_layer')
proba = fluid.layers.softmax(logits)
loss = fluid.layers.softmax_with_cross_entropy(logits, label)
loss = fluid.layers.mean(loss)
acc = fluid.layers.accuracy(input=proba, label=label, k=1)
return loss, acc
def construct_feed_data(data):
"""
Construct the data to feed model.
"""
datas = []
labels = []
for sample in data:
if len(datas) < 16:
labels.append([sample[-1]])
datas.append(sample[1:-1])
else:
yield np.array(datas).astype(np.float32), np.array(labels).astype(
np.int64)
datas = []
labels = []
if len(datas) != 0:
yield np.array(datas).astype(np.float32), np.array(labels).astype(
np.int64)
def run_epoch(exe, data, program, stage, epoch, loss, acc):
"""
The epoch funtcion to run each epoch.
"""
print('start {} epoch of {}'.format(stage, epoch))
all_loss = 0.0
all_acc = 0.0
all_samples = 0.0
count = 0
for datas, labels in construct_feed_data(data):
batch_loss, batch_acc = exe.run(
program,
fetch_list=[loss, acc],
feed={"emb_x": datas,
"label_y": labels})
len_samples = len(datas)
all_loss = batch_loss * len_samples
all_acc = batch_acc * len_samples
all_samples += len_samples
count += 1
print("pass:{}, epoch:{}, loss:{}, acc:{}".format(stage, epoch, batch_loss,
all_acc / (len_samples)))
def train_lr_model(args, data):
"""
The main function to run the lr model.
"""
data_nums = len(data)
train_data_nums = int(0.8 * data_nums)
train_data = data[:train_data_nums]
test_data = data[train_data_nums:]
place = fluid.CPUPlace()
train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
loss, acc = build_lr_model(args)
test_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=args.lr)
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(startup_program)
for epoch in range(0, args.epoch):
run_epoch(exe, train_data, train_program, "train", epoch, loss, acc)
print('-------------------')
run_epoch(exe, test_data, test_program, "valid", epoch, loss, acc)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
data_loader.py
"""
from pgl import graph
import numpy as np
class EdgeDataset():
"""
The data load just read the edge file, at the same time reindex the source and destination.
"""
def __init__(self, undirected=True, data_dir=""):
self._undirected = undirected
self._data_dir = data_dir
self._load_edge_data()
def _load_edge_data(self):
node_sets = set()
edges = []
with open(self._data_dir, "r") as f:
node_dict = dict()
for line in f:
src, dist = [
int(data) for data in line.strip("\n\r").split(" ")
]
if src not in node_dict:
node_dict[src] = len(node_dict) + 1
src = node_dict[src]
if dist not in node_dict:
node_dict[dist] = len(node_dict) + 1
dist = node_dict[dist]
node_sets.add(src)
node_sets.add(dist)
edges.append((src, dist))
if self._undirected:
edges.append((dist, src))
num_nodes = len(node_sets)
self.graph = graph.Graph(num_nodes=num_nodes + 1, edges=edges)
self.nodes = np.array(list(node_sets))
self.node_dict = node_dict
"""
sklearn_classify.py
"""
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
random_seed = 67
def train_lr_l2_model(args, data):
"""
The main function to train lr model with l2 regularization.
"""
acc_list = []
data = np.array(data)
data = data[data[:, 0].argsort()]
x_data = data[:, 1:-1]
y_data = data[:, -1]
for random_num in range(0, 10):
X_train, X_test, y_train, y_test = train_test_split(
x_data,
y_data,
test_size=0.2,
random_state=random_num + random_seed)
# use the one vs rest to train the lr model with l2
pred_test = []
for i in range(0, args.num_class):
y_train_relabel = np.where(y_train == i, 1, 0)
y_test_relabel = np.where(y_test == i, 1, 0)
lr = LogisticRegression(C=10.0, random_state=0, max_iter=100)
lr.fit(X_train, y_train_relabel)
pred = lr.predict_proba(X_test)
pred_test.append(pred[:, -1].tolist())
pred_test = np.array(pred_test)
pred_test = np.transpose(pred_test)
c_index = np.argmax(pred_test, axis=1)
acc = accuracy_score(y_test.flatten(), c_index)
acc_list.append(acc)
print("pass:{}-acc:{}".format(random_num, acc))
print("the avg acc is {}".format(np.mean(acc_list)))
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
struc2vec.py
"""
import argparse
import math
import random
import numpy as np
import pgl
from pgl import graph
from pgl.graph_kernel import alias_sample_build_table
from pgl.sample import alias_sample
from data_loader import EdgeDataset
from classify import train_lr_model
from sklearn_classify import train_lr_l2_model
def selectDegrees(degree_root, index_left, index_right, degree_left,
degree_right):
"""
Select the which degree to be next step.
"""
if index_left == -1:
degree_now = degree_right
elif index_right == -1:
degree_now = degree_left
elif (abs(degree_left - degree_root) < abs(degree_right - degree_root)):
degree_now = degree_left
else:
degree_now = degree_right
return degree_now
class StrucVecGraph():
"""
The class wrapper the PGL graph, the class involve the funtions to implement struc2vec algorithm.
"""
def __init__(self, graph, nodes, opt1, opt2, opt3, depth, num_walks,
walk_depth):
self.graph = graph
self.nodes = nodes
self.opt1 = opt1
self.opt2 = opt2
self.opt3 = opt3
self.num_walks = num_walks
self.walk_depth = walk_depth
self.tag = args.tag
self.degree_list = dict()
self.degree2nodes = dict()
self.node2degree = dict()
self.distance = dict()
self.degrees_sorted = None
self.layer_distance = dict()
self.layer_message = dict()
self.layer_norm_distance = dict()
self.sample_alias = dict()
self.sample_events = dict()
self.layer_node_weight_count = dict()
if opt3 == True:
self.depth = depth
else:
self.depth = 1000
def distance_func(self, a, b):
"""
The basic function to calculate the distance between two list with different length.
"""
ep = 0.5
m = max(a, b) + ep
mi = min(a, b) + ep
return ((m / mi) - 1)
def distance_opt1_func(self, a, b):
"""
The optimization function to calculate the distance between two list with list count.
"""
ep = 0.5
m = max(a[0], b[0]) + ep
mi = min(a[0], b[0]) + ep
return ((m / mi) - 1) * max(a[1], b[1])
def add_degree_todict(self, node_id, degree, depth, opt1):
"""
output the degree of each node to a dict
"""
if node_id not in self.degree_list:
self.degree_list[node_id] = dict()
if depth not in self.degree_list[node_id]:
self.degree_list[node_id][depth] = None
if opt1:
degree = np.array(np.unique(degree, return_counts=True)).T
self.degree_list[node_id][depth] = degree
def output_degree_with_depth(self, depth, opt1):
"""
according to the BFS to get the degree of each layer
"""
degree_dict = dict()
for node in self.nodes:
start_node = node
cur_node = node
cur_dep = 0
flag_visit = set()
while cur_node is not None and cur_dep < depth:
if not isinstance(cur_node, list):
cur_node = [cur_node]
filter_node = []
for node in cur_node:
if node not in flag_visit:
flag_visit.add(node)
filter_node.append(node)
cur_node = filter_node
if len(cur_node) == 0:
break
outdegree = self.graph.outdegree(cur_node)
mask = (outdegree != 0)
if np.any(mask):
outdegree = np.sort(outdegree[mask])
else:
break
# save the layer degree message to dict
self.add_degree_todict(start_node, outdegree[mask], cur_dep,
opt1)
succes = self.graph.successor(cur_node)
cur_node = []
for succ in succes:
if isinstance(succ, np.ndarray):
cur_node.extend(succ.flatten().tolist())
elif isinstance(succ, int):
cur_node.append(succ)
cur_node = list(set(cur_node))
cur_dep += 1
def get_sim_neighbours(self, node, selected_num):
"""
Select the neighours by using the degree similiarity.
"""
degree = self.node2degree[node]
select_count = 0
node_nbh_list = list()
for node_nbh in self.degree2nodes[degree]:
if node != node_nbh:
node_nbh_list.append(node_nbh)
select_count += 1
if select_count > selected_num:
return node_nbh_list
degree_vec_len = len(self.degrees_sorted)
index_degree = self.degrees_sorted.index(degree)
index_left = -1
index_right = -1
degree_left = -1
degree_right = -1
if index_degree != -1 and index_degree >= 1:
index_left = index_degree - 1
if index_degree != -1 and index_degree <= degree_vec_len - 2:
index_right = index_degree + 1
if index_left == -1 and index_right == -1:
return node_nbh_list
if index_left != -1:
degree_left = self.degrees_sorted[index_left]
if index_right != -1:
degree_right = self.degrees_sorted[index_right]
select_degree = selectDegrees(degree, index_left, index_right,
degree_left, degree_right)
while True:
for node_nbh in self.degree2nodes[select_degree]:
if node_nbh != node:
node_nbh_list.append(node_nbh)
select_count += 1
if select_count > selected_num:
return node_nbh_list
if select_degree == degree_left:
if index_left >= 1:
index_left = index_left - 1
else:
index_left = -1
else:
if index_right <= degree_vec_len - 2:
index_right += 1
else:
index_right = -1
if index_left == -1 and index_right == -1:
return node_nbh_list
if index_left != -1:
degree_left = self.degrees_sorted[index_left]
if index_right != -1:
degree_right = self.degrees_sorted[index_right]
select_degree = selectDegrees(degree, index_left, index_right,
degree_left, degree_right)
return node_nbh_list
def calc_node_with_neighbor_dtw_opt2(self, src):
"""
Use the optimization algorithm to reduce the next steps range.
"""
from fastdtw import fastdtw
node_nbh_list = self.get_sim_neighbours(src, self.selected_nbh_nums)
distance = {}
for dist in node_nbh_list:
calc_layer_len = min(len(self.degree_list[src]), \
len(self.degree_list[dist]))
distance_iteration = 0.0
distance[src, dist] = {}
for layer in range(0, calc_layer_len):
src_layer = self.degree_list[src][layer]
dist_layer = self.degree_list[dist][layer]
weight, path = fastdtw(
src_layer,
dist_layer,
radius=1,
dist=self.distance_calc_func)
distance_iteration += weight
distance[src, dist][layer] = distance_iteration
return distance
def calc_node_with_neighbor_dtw(self, src_index):
"""
No optimization algorithm to reduce the next steps range, just calculate distance of all path.
"""
from fastdtw import fastdtw
distance = {}
for dist_index in range(src_index + 1, self.graph.num_nodes - 1):
src = self.nodes[src_index]
dist = self.nodes[dist_index]
calc_layer_len = min(len(self.degree_list[src]), \
len(self.degree_list[dist]))
distance_iteration = 0.0
distance[src, dist] = {}
for layer in range(0, calc_layer_len):
src_layer = self.degree_list[src][layer]
dist_layer = self.degree_list[dist][layer]
weight, path = fastdtw(
src_layer,
dist_layer,
radius=1,
dist=self.distance_calc_func)
distance_iteration += weight
distance[src, dist][layer] = distance_iteration
return distance
def calc_distances_between_nodes(self):
"""
Use the dtw algorithm to calculate the distance between nodes.
"""
from fastdtw import fastdtw
from pathos.multiprocessing import Pool
# decide use which algo to use
if self.opt1 == True:
self.distance_calc_func = self.distance_opt1_func
else:
self.distance_calc_func = self.distance_func
dtws = []
if self.opt2:
depth = 0
for node in self.nodes:
if node in self.degree_list:
if depth in self.degree_list[node]:
degree = self.degree_list[node][depth]
if args.opt1:
degree = degree[0][0]
else:
degree = degree[0]
if degree not in self.degree2nodes:
self.degree2nodes[degree] = []
if node not in self.node2degree:
self.node2degree[node] = degree
self.degree2nodes[degree].append(node)
# select the log(n) node to select data
degree_keys = self.degree2nodes.keys()
degree_keys = np.array(list(degree_keys), dtype='int')
self.degrees_sorted = list(np.sort(degree_keys))
selected_nbh_nums = 2 * math.log(self.graph.num_nodes - 1, 2)
self.selected_nbh_nums = selected_nbh_nums
pool = Pool(10)
dtws = pool.map(self.calc_node_with_neighbor_dtw_opt2, self.nodes)
pool.close()
pool.join()
else:
src_indices = range(0, self.graph.num_nodes - 2)
pool = Pool(10)
dtws = pool.map(self.calc_node_with_neighbor_dtw, src_indices)
pool.close()
pool.join()
print('calc the dtw done.')
for dtw in dtws:
self.distance.update(dtw)
def normlization_layer_weight(self):
"""
Normlation the distance between nodes, weight[1, 2, ....N] = distance[1, 2, ......N] / sum(distance)
"""
for sd_keys, layer_weight in self.distance.items():
src, dist = sd_keys
layers, weights = layer_weight.keys(), layer_weight.values()
for layer, weight in zip(layers, weights):
if layer not in self.layer_distance:
self.layer_distance[layer] = {}
if layer not in self.layer_message:
self.layer_message[layer] = {}
self.layer_distance[layer][src, dist] = weight
if src not in self.layer_message[layer]:
self.layer_message[layer][src] = []
if dist not in self.layer_message[layer]:
self.layer_message[layer][dist] = []
self.layer_message[layer][src].append(dist)
self.layer_message[layer][dist].append(src)
# normalization the layer weight
for i in range(0, self.depth):
layer_weight = 0.0
layer_count = 0
if i not in self.layer_norm_distance:
self.layer_norm_distance[i] = {}
if i not in self.sample_alias:
self.sample_alias[i] = {}
if i not in self.sample_events:
self.sample_events[i] = {}
if i not in self.layer_message:
continue
for node in self.nodes:
if node not in self.layer_message[i]:
continue
nbhs = self.layer_message[i][node]
weights = []
sum_weight = 0.0
for dist in nbhs:
if (node, dist) in self.layer_distance[i]:
weight = self.layer_distance[i][node, dist]
else:
weight = self.layer_distance[i][dist, node]
weight = np.exp(-float(weight))
weights.append(weight)
# norm the weight
sum_weight = sum(weights)
if sum_weight == 0.0:
sum_weight = 1.0
weight_list = [weight / sum_weight for weight in weights]
self.layer_norm_distance[i][node] = weight_list
alias, events = alias_sample_build_table(np.array(weight_list))
self.sample_alias[i][node] = alias
self.sample_events[i][node] = events
layer_weight += 1.0
#layer_weight += sum(weight_list)
layer_count += len(weights)
layer_avg_weight = layer_weight / (1.0 * layer_count)
self.layer_node_weight_count[i] = dict()
for node in self.nodes:
if node not in self.layer_norm_distance[i]:
continue
weight_list = self.layer_norm_distance[i][node]
node_cnt = 0
for weight in weight_list:
if weight > layer_avg_weight:
node_cnt += 1
self.layer_node_weight_count[i][node] = node_cnt
def choose_neighbor_alias_method(self, node, layer):
"""
Choose the neighhor with strategy of random
"""
weight_list = self.layer_norm_distance[layer][node]
neighbors = self.layer_message[layer][node]
select_idx = alias_sample(1, self.sample_alias[layer][node],
self.sample_events[layer][node])
return neighbors[select_idx[0]]
def choose_layer_to_walk(self, node, layer):
"""
Choose the layer to random walk
"""
random_value = random.random()
higher_neigbours_nums = self.layer_node_weight_count[layer][node]
prob = math.log(higher_neigbours_nums + math.e)
prob = prob / (1.0 + prob)
if random_value > prob:
if layer > 0:
layer = layer - 1
else:
if layer + 1 in self.layer_message and \
node in self.layer_message[layer + 1]:
layer = layer + 1
return layer
def executor_random_walk(self, walk_process_id):
"""
The main function to execute the structual random walk
"""
nodes = self.nodes
random.shuffle(nodes)
walk_path_all_nodes = []
for node in nodes:
walk_path = []
walk_path.append(node)
layer = 0
while len(walk_path) < self.walk_depth:
prop = random.random()
if prop < 0.3:
node = self.choose_neighbor_alias_method(node, layer)
walk_path.append(node)
else:
layer = self.choose_layer_to_walk(node, layer)
walk_path_all_nodes.append(walk_path)
return walk_path_all_nodes
def random_walk_structual_sim(self):
"""
According to struct distance to walk the path
"""
from pathos.multiprocessing import Pool
print('start process struc2vec random walk.')
walks_process_ids = [i for i in range(0, self.num_walks)]
pool = Pool(10)
walks = pool.map(self.executor_random_walk, walks_process_ids)
pool.close()
pool.join()
#save the final walk result
file_result = open(args.tag + "_walk_path", "w")
for walk in walks:
for walk_node in walk:
walk_node_str = " ".join([str(node) for node in walk_node])
file_result.write(walk_node_str + "\n")
file_result.close()
print('process struc2vec random walk done.')
def learning_embedding_from_struc2vec(args):
"""
Learning the word2vec from the random path
"""
from gensim.models import Word2Vec
from gensim.models.word2vec import LineSentence
struc_walks = LineSentence(args.tag + "_walk_path")
model = Word2Vec(struc_walks, size=args.w2v_emb_size, window=args.w2v_window_size, iter=args.w2v_epoch, \
min_count=0, hs=1, sg=1, workers=5)
model.wv.save_word2vec_format(args.emb_file)
def main(args):
"""
The main fucntion to run the algorithm struc2vec
"""
if args.train:
dataset = EdgeDataset(
undirected=args.undirected, data_dir=args.edge_file)
graph = StrucVecGraph(dataset.graph, dataset.nodes, args.opt1, args.opt2, args.opt3, args.depth,\
args.num_walks, args.walk_depth)
graph.output_degree_with_depth(args.depth, args.opt1)
graph.calc_distances_between_nodes()
graph.normlization_layer_weight()
graph.random_walk_structual_sim()
learning_embedding_from_struc2vec(args)
file_label = open(args.label_file)
file_label_reindex = open(args.label_file + "_reindex", "w")
for line in file_label:
items = line.strip("\n\r").split(" ")
try:
items = [int(item) for item in items]
except:
continue
if items[0] not in dataset.node_dict:
continue
reindex = dataset.node_dict[items[0]]
file_label_reindex.write(str(reindex) + " " + str(items[1]) + "\n")
file_label_reindex.close()
if args.valid:
emb_file = open(args.emb_file)
file_label_reindex = open(args.label_file + "_reindex")
label_dict = dict()
for line in file_label_reindex:
items = line.strip("\n\r").split(" ")
try:
label_dict[int(items[0])] = int(items[1])
except:
continue
data_for_train_valid = []
for line in emb_file:
items = line.strip("\n\r").split(" ")
if len(items) <= 2:
continue
index = int(items[0])
label = int(label_dict[index])
sample = []
sample.append(index)
feature_emb = items[1:]
feature_emb = [float(feature) for feature in feature_emb]
sample.extend(feature_emb)
sample.append(label)
data_for_train_valid.append(sample)
train_lr_l2_model(args, data_for_train_valid)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='struc2vec')
parser.add_argument("--edge_file", type=str, default="")
parser.add_argument("--label_file", type=str, default="")
parser.add_argument("--emb_file", type=str, default="w2v_emb")
parser.add_argument("--undirected", type=bool, default=True)
parser.add_argument("--depth", type=int, default=8)
parser.add_argument("--num_walks", type=int, default=10)
parser.add_argument("--walk_depth", type=int, default=80)
parser.add_argument("--opt1", type=bool, default=False)
parser.add_argument("--opt2", type=bool, default=False)
parser.add_argument("--opt3", type=bool, default=False)
parser.add_argument("--w2v_emb_size", type=int, default=128)
parser.add_argument("--w2v_window_size", type=int, default=10)
parser.add_argument("--w2v_epoch", type=int, default=5)
parser.add_argument("--train", type=bool, default=False)
parser.add_argument("--valid", type=bool, default=False)
parser.add_argument("--lr", type=float, default=0.0001)
parser.add_argument("--num_class", type=int, default=4)
parser.add_argument("--epoch", type=int, default=2000)
parser.add_argument("--tag", type=str, default="")
args = parser.parse_args()
main(args)
# Unsupervised GraphSAGE in PGL
[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.
For purpose of unsupervised learning, we use graph edges as positive samples for graphsage training.
### Datasets(Quickstart)
The dataset `./sample.txt` is handcrafted bigraph for quick demo purpose, which format is `src \t dst`.
### Dependencies
```txt
- paddlepaddle>=1.6
- pgl
```
### How to run
#### 1. Training
```sh
python train.py --data_path ./sample.txt --num_nodes 2000 --phase train
```
#### 2. Predicting
```sh
python train.py --data_path ./sample.txt --num_nodes 2000 --phase predict
```
The resulted node embedding is stored in `emb.npy` file, which latter can be loaded using `np.load`.
#### Hyperparameters
- epoch: Number of epochs default (1)
- use_cuda: Use gpu if assign use_cuda.
- layer_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm".
- sample_workers: The number of workers for multiprocessing subgraph sample.
- lr: Learning rate.
- batch_size: Batch size.
- samples: The max neighbors sampling rate for each hop. (default: [10, 10])
- num_layers: The number of layer for graph sampling. (default: 2)
- hidden_size: The hidden size of the GraphSAGE models.
- checkpoint. Path for model checkpoint at each epoch. (default: 'model_ckpt')
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""model.py"""
import paddle
import paddle.fluid as fluid
def copy_send(src_feat, dst_feat, edge_feat):
"""copy_send"""
return src_feat["h"]
def mean_recv(feat):
"""mean_recv"""
return fluid.layers.sequence_pool(feat, pool_type="average")
def sum_recv(feat):
"""sum_recv"""
return fluid.layers.sequence_pool(feat, pool_type="sum")
def max_recv(feat):
"""max_recv"""
return fluid.layers.sequence_pool(feat, pool_type="max")
def lstm_recv(feat):
"""lstm_recv"""
hidden_dim = 128
forward, _ = fluid.layers.dynamic_lstm(
input=feat, size=hidden_dim * 4, use_peepholes=False)
output = fluid.layers.sequence_last_step(forward)
return output
def graphsage_mean(gw, feature, hidden_size, act, name):
"""graphsage_mean"""
msg = gw.send(copy_send, nfeat_list=[("h", feature)])
neigh_feature = gw.recv(msg, mean_recv)
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_meanpool(gw,
feature,
hidden_size,
act,
name,
inner_hidden_size=512):
"""graphsage_meanpool"""
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
neigh_feature = gw.recv(msg, mean_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_maxpool(gw,
feature,
hidden_size,
act,
name,
inner_hidden_size=512):
"""graphsage_maxpool"""
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
msg = gw.send(copy_send, nfeat_list=[("h", neigh_feature)])
neigh_feature = gw.recv(msg, max_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
def graphsage_lstm(gw, feature, hidden_size, act, name):
"""graphsage_lstm"""
inner_hidden_size = 128
neigh_feature = fluid.layers.fc(feature, inner_hidden_size, act="relu")
hidden_dim = 128
forward_proj = fluid.layers.fc(input=neigh_feature,
size=hidden_dim * 4,
bias_attr=False,
name="lstm_proj")
msg = gw.send(copy_send, nfeat_list=[("h", forward_proj)])
neigh_feature = gw.recv(msg, lstm_recv)
neigh_feature = fluid.layers.fc(neigh_feature,
hidden_size,
act=act,
name=name + '_r')
self_feature = feature
self_feature = fluid.layers.fc(self_feature,
hidden_size,
act=act,
name=name + '_l')
output = fluid.layers.concat([self_feature, neigh_feature], axis=1)
output = fluid.layers.l2_normalize(output, axis=1)
return output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""reader.py"""
import os
import numpy as np
import pickle as pkl
import paddle
import paddle.fluid as fluid
import pgl
import time
from pgl.utils.logger import log
from pgl.utils import mp_reader
def batch_iter(data, batch_size):
"""batch_iter"""
src, dst, eid = data
perm = np.arange(len(eid))
np.random.shuffle(perm)
start = 0
while start < len(src):
index = perm[start:start + batch_size]
start += batch_size
yield src[index], dst[index], eid[index]
def traverse(item):
"""traverse"""
if isinstance(item, list) or isinstance(item, np.ndarray):
for i in iter(item):
for j in traverse(i):
yield j
else:
yield item
def flat_node_and_edge(nodes, eids):
"""flat_node_and_edge"""
nodes = list(set(traverse(nodes)))
eids = list(set(traverse(eids)))
return nodes, eids
def graph_reader(num_layers,
graph_wrappers,
data,
batch_size,
samples,
num_workers,
feed_name_list,
use_pyreader=False,
graph=None,
predict=False):
"""graph_reader
"""
assert num_layers == len(samples), "Must be unified number of layers!"
if num_workers > 1:
return multiprocess_graph_reader(
num_layers,
graph_wrappers,
data,
batch_size,
samples,
num_workers,
feed_name_list,
use_pyreader,
graph=graph,
predict=predict)
batch_info = list(batch_iter(data, batch_size=batch_size))
work = worker(
num_layers,
batch_info,
graph_wrappers,
samples,
feed_name_list,
use_pyreader,
graph=graph,
predict=predict)
def reader():
"""reader"""
for batch in work():
yield batch
return reader
#return paddle.reader.buffered(reader, 100)
def worker(num_layers, batch_info, graph_wrappers, samples, feed_name_list,
use_pyreader, graph, predict):
"""worker
"""
pid = os.getppid()
np.random.seed((int(time.time() * 10000) + pid) % 65535)
graphs = [graph, graph]
def work():
"""work
"""
feed_dict = {}
ind = 0
perm = np.arange(0, len(batch_info))
np.random.shuffle(perm)
for p in perm:
batch_src, batch_dst, batch_eid = batch_info[p]
ind += 1
ind_start = time.time()
try:
nodes = start_nodes = np.concatenate([batch_src, batch_dst], 0)
eids = []
layer_nodes, layer_eids = [], []
for layer_idx in reversed(range(num_layers)):
if len(start_nodes) == 0:
layer_nodes = [nodes] + layer_nodes
layer_eids = [eids] + layer_eids
continue
pred_nodes, pred_eids = graphs[
layer_idx].sample_predecessor(
start_nodes, samples[layer_idx], return_eids=True)
last_nodes = nodes
nodes, eids = flat_node_and_edge([nodes, pred_nodes],
[eids, pred_eids])
layer_nodes = [nodes] + layer_nodes
layer_eids = [eids] + layer_eids
# Find new nodes
start_nodes = list(set(nodes) - set(last_nodes))
if predict is False:
eids = (batch_eid * 2 + 1).tolist() + (batch_eid * 2
).tolist()
layer_eids[0] = list(set(layer_eids[0]) - set(eids))
# layer_nodes[0]: use first layer nodes as all subgraphs' nodes
subgraph = graphs[0].subgraph(
nodes=layer_nodes[0], eid=layer_eids[0])
node_feat = np.array(layer_nodes[0], dtype="int64")
subgraph.node_feat["index"] = node_feat
except Exception as e:
print(e)
if len(feed_dict) > 0:
yield feed_dict
continue
feed_dict = graph_wrappers[0].to_feed(subgraph)
# only reindex from first subgraph
sub_src_idx = subgraph.reindex_from_parrent_nodes(batch_src)
sub_dst_idx = subgraph.reindex_from_parrent_nodes(batch_dst)
feed_dict["src_index"] = sub_src_idx.astype("int64")
feed_dict["dst_index"] = sub_dst_idx.astype("int64")
if predict:
feed_dict["node_id"] = batch_src.astype("int64")
if use_pyreader:
yield [feed_dict[name] for name in feed_name_list]
else:
yield feed_dict
return work
def multiprocess_graph_reader(num_layers, graph_wrappers, data, batch_size,
samples, num_workers, feed_name_list,
use_pyreader, graph, predict):
""" multiprocess_graph_reader
"""
def parse_to_subgraph(rd):
""" parse_to_subgraph
"""
def work():
""" work
"""
for data in rd():
yield data
return work
def reader():
""" reader
"""
batch_info = list(batch_iter(data, batch_size=batch_size))
log.info("The size of batch:%d" % (len(batch_info)))
block_size = int(len(batch_info) / num_workers + 1)
reader_pool = []
for i in range(num_workers):
reader_pool.append(
worker(num_layers, batch_info[block_size * i:block_size * (
i + 1)], graph_wrappers, samples, feed_name_list,
use_pyreader, graph, predict))
use_pipe = True
multi_process_sample = mp_reader.multiprocess_reader(
reader_pool, use_pipe=use_pipe)
r = parse_to_subgraph(multi_process_sample)
if use_pipe:
return paddle.reader.buffered(r, 5 * num_workers)
else:
return r
return reader()
265 1599
979 1790
650 1488
638 1310
962 1916
239 1958
103 1763
918 1874
599 1924
47 1691
272 1978
550 1583
163 1142
561 1458
211 1447
188 1529
983 1039
68 1923
715 1900
657 1555
338 1937
379 1409
19 1978
224 1420
755 1499
618 1172
766 1294
401 1188
89 1257
149 1048
835 1526
358 1858
218 1187
227 1022
530 1643
197 1255
529 1672
960 1558
519 1176
433 1093
347 1495
572 1877
505 1047
988 1587
125 1249
555 1942
614 1586
836 1681
628 1076
28 1693
519 1398
133 1136
883 1493
158 1441
568 1928
723 1585
488 1331
719 1471
265 1113
174 1799
722 1226
744 1467
807 1075
839 1393
664 1380
689 1552
36 1864
211 1611
90 1444
819 1428
241 1551
746 1599
72 1098
712 1787
54 1575
677 1485
289 1007
289 1079
907 1144
7 1983
655 1272
638 1047
849 1957
492 1278
453 1304
657 1807
367 1002
141 1346
688 1450
984 1749
255 1240
156 1625
731 1051
211 1922
165 1805
765 1054
794 1555
709 1747
822 1099
805 1774
422 1240
728 1679
55 1299
314 1808
781 1689
558 1605
707 1110
510 1705
956 1064
568 1132
267 1257
868 1269
690 1453
858 1602
826 1373
338 1650
335 1453
458 1340
0 1818
729 1694
25 1816
679 1109
323 1609
614 1457
342 1028
436 1081
932 1139
190 1821
808 1623
717 1267
950 1265
177 1956
97 1380
500 1744
232 1582
119 1015
656 1462
730 1007
860 1142
771 1989
784 1623
976 1084
770 1642
527 1515
784 1943
527 1578
718 1396
942 1089
661 1705
787 1800
893 1932
849 1395
758 1482
424 1148
873 1470
896 1333
465 1021
137 1507
718 1027
7 1045
285 1932
371 1468
51 1692
249 1358
898 1858
688 1213
419 1289
328 1326
764 1786
142 1399
905 1738
976 1295
715 1537
994 1393
479 1291
165 1560
308 1446
691 1728
779 1162
320 1989
745 1579
586 1426
142 1517
45 1317
657 1339
191 1780
801 1216
124 1414
344 1717
682 1383
216 1891
24 1759
207 1080
707 1699
212 1606
902 1435
525 1174
349 1299
380 1840
265 1294
352 1390
439 1410
984 1481
423 1499
261 1484
70 1033
192 1909
36 1960
823 1109
132 1418
992 1257
126 1548
872 1488
287 1645
108 1836
990 1314
450 1119
132 1549
0 1003
748 1373
841 1475
75 1987
880 1458
447 1443
122 1385
209 1022
74 1724
355 1688
742 1892
900 1092
48 1220
525 1221
817 1010
957 1212
713 1558
504 1851
84 1860
695 1187
326 1524
33 1647
864 1637
905 1637
280 1617
47 1034
781 1137
792 1319
901 1850
183 1511
571 1725
111 1957
222 1030
794 1169
147 1973
588 1789
24 1581
597 1471
106 1786
432 1146
447 1325
521 1444
968 1417
13 1075
521 1478
853 1294
550 1550
673 1426
150 1684
369 1737
994 1038
601 1397
616 1400
958 1028
279 1177
920 1180
878 1584
661 1852
225 1631
793 1401
507 1289
177 1818
551 1836
473 1065
723 1383
337 1938
81 1601
62 1139
928 1853
122 1946
260 1289
541 1378
934 1069
52 1311
689 1420
307 1862
811 1691
636 1885
405 1883
337 1132
645 1261
969 1224
823 1106
727 1066
763 1126
54 1168
677 1750
699 1223
744 1183
343 1883
152 1440
534 1665
79 1853
272 1581
92 1309
756 1884
460 1305
595 1868
469 1904
552 1067
422 1318
673 1843
403 1174
224 1445
181 1566
389 1618
936 1479
80 1002
291 1611
776 1201
57 1495
397 1053
807 1810
763 1374
648 1054
869 1432
169 1083
891 1318
270 1200
833 1663
970 1653
363 1637
188 1192
116 1751
110 1035
204 1216
524 1995
914 1426
289 1814
357 1521
366 1808
176 1775
650 1959
775 1062
781 1712
396 1798
725 1577
864 1497
540 1188
321 1623
995 1622
719 1299
72 1656
348 1728
141 1547
722 1095
64 1689
747 1143
892 1758
381 1463
693 1199
89 1555
576 1313
253 1809
878 1466
954 1776
365 1366
716 1351
707 1441
325 1167
63 1385
430 1225
479 1159
13 1185
731 1653
373 1529
271 1904
631 1111
114 1758
502 1983
685 1261
719 1932
1 1646
738 1698
432 1294
197 1463
293 1626
434 1457
315 1481
552 1877
100 1103
294 1569
689 1377
84 1142
631 1935
87 1508
560 1358
5 1787
65 1877
114 1948
536 1435
223 1753
494 1230
139 1335
55 1306
481 1253
326 1662
7 1171
663 1992
353 1586
693 1397
70 1498
902 1897
729 1627
838 1296
9 1528
633 1988
216 1535
813 1534
528 1061
130 1705
889 1019
278 1810
937 1399
286 1498
166 1574
725 1506
202 1018
306 1420
553 1717
755 1731
561 1619
147 1981
862 1065
349 1219
573 1137
336 1871
473 1511
342 1051
983 1181
798 1663
197 1930
164 1477
954 1083
695 1879
964 1046
638 1817
404 1886
927 1211
554 1115
88 1417
345 1165
383 1551
412 1484
305 1532
57 1380
171 1550
15 1082
941 1507
199 1774
787 1953
125 1398
336 1958
640 1851
251 1127
740 1306
302 1217
786 1014
706 1811
835 1851
978 1262
629 1944
429 1202
714 1954
153 1381
103 1759
268 1286
346 1808
420 1343
947 1467
668 1857
833 1736
600 1008
137 1649
452 1985
480 1545
212 1182
150 1726
784 1217
362 1595
763 1365
68 1395
195 1041
92 1599
314 1397
971 1003
606 1914
711 1706
699 1056
119 1593
367 1476
725 1098
432 1234
684 1255
469 1606
440 1086
200 1848
294 1144
449 1888
376 1225
796 1352
767 1447
713 1845
223 1333
119 1797
752 1927
627 1464
279 1488
40 1562
62 1149
771 1058
600 1911
625 1164
366 1416
714 1530
513 1935
419 1485
963 1665
459 1648
977 1522
890 1521
931 1566
622 1838
158 1958
848 1520
357 1275
43 1440
404 1772
788 1930
841 1832
845 1281
516 1121
423 1130
86 1619
863 1928
195 1789
167 1944
589 1093
146 1206
74 1133
819 1445
678 1004
752 1725
366 1604
903 1738
882 1858
561 1195
436 1980
77 1894
353 1879
561 1166
989 1964
624 1013
572 1704
272 1077
509 1242
770 1001
279 1392
621 1924
542 1766
555 1951
577 1598
531 1148
806 1401
497 1115
872 1309
387 1880
430 1485
295 1175
400 1774
941 1522
336 1032
806 1873
576 1422
566 1974
241 1847
215 1645
670 1804
831 1834
734 1091
16 1641
952 1975
299 1587
442 1032
702 1341
570 1405
633 1651
444 1731
980 1774
381 1729
900 1661
875 1274
968 1095
894 1805
683 1961
130 1549
963 1350
817 1864
190 1281
91 1657
208 1194
621 1911
447 1338
538 1343
234 1534
765 1920
632 1263
96 1090
121 1659
47 1975
856 1354
601 1061
480 1236
808 1487
866 1999
861 1892
667 1124
425 1307
90 1002
725 1337
134 1749
272 1587
567 1276
43 1332
715 1084
967 1477
62 1731
244 1540
317 1112
893 1108
242 1443
688 1544
937 1475
761 1912
994 1219
827 1193
420 1966
109 1691
482 1767
564 1146
372 1215
954 1348
422 1045
987 1040
471 1247
919 1824
190 1615
874 1879
251 1198
611 1575
121 1733
596 1950
791 1492
504 1201
153 1680
719 1967
964 1095
889 1106
732 1770
967 1631
351 1061
912 1835
911 1925
501 1502
810 1406
948 1718
928 1080
384 1940
330 1301
143 1081
412 1649
686 1840
178 1544
266 1121
528 1714
296 1156
220 1753
726 1679
126 1416
364 1424
625 1539
721 1708
805 1639
384 1157
553 1693
570 1877
511 1984
774 1254
354 1949
823 1162
281 1204
657 1774
578 1943
902 1764
859 1063
543 1845
815 1052
430 1118
22 1210
477 1586
872 1692
478 1943
630 1850
928 1247
893 1126
757 1774
133 1275
740 1101
117 1200
931 1120
259 1184
16 1782
447 1131
637 1498
472 1859
760 1877
303 1511
903 1074
795 1227
398 1450
28 1339
428 1891
476 1680
934 1409
78 1737
467 1075
126 1830
0 1421
783 1357
584 1061
139 1166
122 1768
735 1219
202 1684
867 1405
619 1176
843 1833
553 1239
287 1080
373 1780
65 1816
227 1871
45 1701
38 1281
46 1077
911 1708
137 1478
20 1550
822 1631
831 1527
13 1001
509 1096
31 1751
196 1123
379 1614
777 1288
364 1222
478 1070
460 1580
986 1340
696 1498
679 1139
713 1343
91 1691
602 1696
377 1770
253 1021
957 1179
500 1423
487 1281
821 1652
180 1122
443 1247
583 1289
676 1258
781 1693
718 1500
832 1662
555 1029
575 1595
145 1801
471 1769
491 1388
269 1241
159 1428
631 1698
478 1268
925 1141
583 1096
759 1592
967 1352
862 1444
119 1991
534 1602
526 1226
880 1614
236 1615
448 1600
752 1041
25 1127
445 1853
414 1058
127 1913
512 1080
158 1522
787 1287
664 1744
914 1335
899 1630
187 1279
951 1942
884 1777
529 1937
395 1590
478 1066
790 1518
286 1614
640 1528
882 1707
102 1303
716 1794
919 1605
859 1759
236 1321
858 1608
732 1506
435 1263
93 1508
813 1260
640 1668
607 1185
402 1039
943 1569
523 1415
511 1786
637 1934
10 1885
507 1375
544 1988
709 1537
342 1717
324 1393
216 1090
788 1753
362 1308
64 1576
811 1726
555 1636
944 1715
259 1251
141 1888
48 1290
570 1331
957 1104
223 1233
494 1531
423 1433
151 1266
704 1002
694 1685
740 1001
174 1537
947 1359
49 1891
875 1386
274 1621
918 1610
631 1564
961 1960
702 1642
871 1489
384 1642
932 1559
886 1097
842 1143
950 1971
83 1986
944 1135
168 1923
900 1611
684 1389
540 1749
123 1265
673 1617
952 1921
767 1401
696 1941
868 1536
515 1953
438 1757
430 1411
661 1193
527 1882
147 1145
225 1101
710 1671
579 1255
30 1920
906 1298
333 1635
214 1127
362 1189
878 1530
808 1842
419 1559
861 1291
743 1043
333 1257
186 1604
141 1957
751 1236
573 1937
908 1460
627 1155
726 1885
332 1888
267 1040
28 1660
194 1200
971 1788
861 1122
582 1397
176 1091
397 1678
730 1307
309 1860
881 1255
701 1068
750 1103
755 1843
834 1786
900 1837
433 1601
897 1464
593 1661
451 1638
953 1101
122 1123
220 1792
35 1933
726 1751
715 1411
662 1307
197 1322
125 1658
478 1700
772 1881
547 1822
910 1280
924 1933
79 1740
466 1567
53 1768
500 1502
572 1048
751 1194
18 1187
374 1480
158 1135
712 1686
171 1466
25 1036
144 1847
664 1937
301 1129
641 1880
147 1709
885 1911
631 1910
338 1914
628 1257
909 1333
970 1790
971 1691
260 1724
693 1946
857 1056
918 1053
612 1838
479 1407
626 1359
273 1709
633 1008
364 1434
393 1873
294 1300
657 1988
355 1639
635 1468
914 1350
916 1148
305 1381
131 1748
756 1484
758 1203
825 1062
152 1209
441 1164
63 1885
864 1797
165 1036
124 1548
246 1053
810 1398
127 1091
277 1028
860 1069
700 1933
338 1962
211 1770
809 1483
489 1507
123 1382
669 1030
180 1996
972 1922
723 1670
647 1683
422 1440
391 1204
178 1071
421 1598
729 1466
339 1403
419 1326
407 1011
479 1867
722 1076
662 1802
110 1438
759 1868
22 1458
725 1648
958 1753
814 1656
673 1044
962 1020
475 1523
882 1513
802 1227
863 1121
772 1677
714 1072
112 1047
422 1664
419 1718
60 1864
570 1683
536 1673
581 1789
894 1074
739 1311
805 1863
861 1750
55 1748
47 1833
101 1108
872 1008
926 1907
909 1021
53 1233
617 1349
674 1909
507 1567
855 1723
690 1171
973 1859
686 1210
49 1435
146 1915
357 1620
208 1724
76 1583
133 1191
619 1426
190 1497
228 1868
365 1144
360 1770
329 1142
672 1408
91 1997
986 1299
654 1333
93 1475
146 1307
62 1772
502 1058
382 1427
181 1739
74 1104
170 1684
466 1861
147 1747
162 1027
499 1903
813 1621
591 1379
227 1518
110 1999
781 1791
415 1744
257 1846
942 1601
628 1696
317 1001
27 1681
80 1078
794 1279
330 1237
830 1994
728 1673
204 1943
295 1422
159 1499
207 1019
110 1497
439 1526
201 1323
620 1723
501 1157
305 1604
878 1784
483 1653
262 1539
21 1967
191 1836
199 1821
500 1910
232 1499
104 1750
868 1607
288 1013
434 1368
874 1055
870 1257
219 1143
990 1924
70 1764
207 1575
1 1364
405 1498
414 1507
65 1704
868 1415
256 1962
886 1425
834 1587
770 1842
74 1070
778 1750
550 1592
484 1948
669 1401
610 1909
480 1784
182 1147
842 1670
272 1923
371 1407
574 1985
978 1300
369 1286
884 1459
322 1261
456 1418
261 1718
330 1708
83 1249
473 1188
542 1281
551 1262
801 1288
372 1574
676 1927
44 1222
190 1020
284 1513
866 1845
828 1977
620 1854
288 1086
367 1606
71 1770
114 1316
571 1850
224 1272
406 1095
902 1571
576 1886
576 1562
767 1443
644 1201
295 1009
944 1751
90 1708
663 1042
283 1708
758 1027
851 1684
537 1204
271 1697
541 1885
973 1218
694 1904
822 1999
194 1872
276 1297
909 1886
312 1706
516 1473
844 1236
62 1617
366 1866
127 1474
743 1215
286 1096
87 1795
69 1711
757 1530
333 1844
257 1796
515 1491
66 1851
117 1510
18 1967
553 1979
267 1060
99 1321
861 1155
506 1067
944 1727
964 1171
329 1159
856 1018
858 1931
765 1617
951 1457
903 1184
241 1717
285 1533
320 1286
409 1400
924 1999
719 1501
14 1550
866 1246
86 1987
868 1551
620 1495
285 1918
810 1733
754 1871
755 1418
394 1528
839 1856
927 1964
321 1381
758 1337
635 1986
404 1038
854 1124
600 1507
342 1517
756 1567
498 1350
944 1048
481 1899
904 1335
412 1492
218 1021
636 1556
417 1354
116 1960
173 1267
525 1086
312 1389
973 1064
619 1103
987 1394
447 1188
862 1969
930 1485
419 1157
756 1787
860 1821
58 1662
353 1437
345 1290
753 1889
412 1688
37 1319
753 1201
136 1253
949 1592
459 1756
976 1522
450 1868
936 1384
393 1653
385 1936
704 1840
616 1709
786 1438
291 1830
848 1112
975 1595
967 1231
741 1672
160 1217
254 1634
530 1610
0 1445
170 1236
164 1316
127 1330
302 1627
953 1449
156 1583
784 1210
226 1551
397 1325
564 1825
42 1027
725 1612
114 1802
483 1384
684 1352
463 1908
978 1226
445 1217
800 1969
556 1274
49 1049
777 1808
732 1982
749 1590
574 1433
462 1515
637 1702
344 1224
489 1586
45 1242
755 1144
716 1293
319 1595
831 1657
154 1562
396 1814
657 1704
442 1405
898 1698
970 1287
967 1068
25 1761
211 1183
691 1905
466 1116
99 1521
834 1871
408 1809
8 1007
483 1336
485 1896
849 1467
192 1341
779 1801
678 1596
276 1051
709 1252
759 1656
27 1621
273 1911
697 1898
450 1995
688 1717
52 1966
920 1957
437 1549
533 1627
130 1315
392 1676
73 1886
650 1254
352 1079
165 1930
388 1236
426 1370
625 1648
457 1858
17 1109
926 1431
853 1530
90 1766
586 1275
894 1244
331 1469
447 1183
132 1167
230 1198
501 1240
440 1100
58 1665
85 1864
913 1448
738 1041
486 1012
162 1767
877 1060
10 1485
514 1807
224 1453
781 1340
311 1645
720 1837
259 1252
54 1174
788 1926
375 1440
23 1880
977 1632
389 1445
38 1508
517 1927
798 1598
483 1391
541 1788
46 1329
816 1758
158 1317
900 1577
369 1255
227 1795
37 1630
813 1565
965 1663
953 1963
503 1221
223 1064
161 1498
717 1855
527 1349
773 1813
522 1630
767 1275
582 1305
541 1563
79 1403
794 1544
74 1161
548 1543
18 1739
516 1516
697 1422
259 1840
195 1273
412 1222
571 1301
203 1914
420 1256
327 1277
894 1315
929 1302
773 1429
302 1309
488 1728
403 1256
549 1342
940 1764
524 1226
409 1076
233 1421
753 1667
664 1257
359 1079
291 1973
199 1373
654 1498
645 1074
481 1607
432 1852
692 1206
498 1726
586 1249
555 1338
107 1563
473 1300
51 1031
345 1236
757 1907
548 1088
680 1430
349 1468
435 1451
884 1301
683 1645
280 1388
84 1393
585 1561
86 1338
261 1972
941 1523
306 1697
718 1192
930 1121
726 1639
617 1399
939 1184
511 1084
832 1662
377 1881
371 1725
393 1653
415 1528
254 1572
927 1447
848 1355
797 1983
613 1417
127 1835
715 1471
974 1999
355 1178
675 1820
415 1601
593 1186
648 1907
922 1931
859 1828
110 1809
547 1809
944 1841
106 1446
635 1762
866 1431
199 1373
595 1454
991 1626
903 1720
989 1465
509 1506
168 1653
742 1892
644 1457
972 1046
87 1807
79 1596
24 1470
313 1732
772 1976
226 1146
835 1835
107 1057
430 1719
203 1810
643 1477
30 1918
889 1216
750 1501
180 1660
71 1463
966 1588
261 1858
829 1804
774 1379
342 1765
328 1943
296 1939
937 1444
628 1407
0 1977
233 1097
359 1438
910 1911
963 1026
942 1483
706 1997
682 1974
900 1513
298 1463
893 1855
322 1360
604 1122
948 1091
828 1158
682 1198
466 1781
661 1031
884 1744
891 1299
688 1266
89 1325
3 1026
299 1861
413 1062
775 1812
560 1926
799 1473
936 1445
537 1718
591 1680
202 1140
906 1163
977 1709
482 1904
345 1181
486 1502
445 1292
305 1328
87 1851
803 1197
94 1937
574 1546
643 1302
704 1633
536 1238
329 1663
737 1969
663 1278
335 1416
873 1390
705 1607
139 1436
740 1904
974 1321
338 1350
694 1456
779 1035
639 1238
603 1768
245 1363
390 1329
141 1680
483 1613
226 1632
820 1303
424 1655
54 1618
399 1297
130 1295
169 1996
78 1455
525 1409
741 1860
887 1664
347 1878
391 1343
66 1243
287 1876
35 1750
492 1261
789 1404
917 1041
937 1756
69 1239
218 1981
142 1382
882 1052
757 1290
178 1593
962 1504
781 1090
648 1912
207 1551
472 1372
937 1427
37 1270
511 1721
208 1491
299 1193
167 1718
781 1100
689 1177
732 1202
852 1665
556 1152
256 1908
261 1473
918 1941
755 1786
77 1062
208 1633
451 1502
181 1513
311 1571
240 1404
470 1720
913 1239
947 1553
706 1158
215 1968
912 1213
684 1117
560 1825
787 1083
764 1654
566 1252
238 1959
953 1954
985 1437
835 1434
88 1896
469 1447
655 1672
760 1631
919 1516
683 1698
811 1123
911 1961
302 1273
344 1399
89 1289
936 1236
395 1575
417 1981
10 1115
878 1839
213 1171
484 1475
460 1901
708 1299
320 1544
965 1375
451 1144
116 1959
143 1384
843 1051
368 1953
994 1141
704 1641
385 1729
240 1851
967 1306
719 1878
726 1439
550 1613
261 1660
550 1511
154 1782
12 1087
328 1120
618 1763
422 1667
519 1854
639 1719
942 1705
814 1893
576 1491
139 1499
422 1956
95 1082
676 1262
287 1965
60 1867
713 1444
435 1021
606 1042
86 1891
58 1035
311 1320
140 1463
82 1415
756 1991
505 1140
510 1982
701 1579
428 1787
388 1279
446 1709
222 1060
550 1363
798 1691
219 1181
137 1225
828 1955
721 1417
82 1675
854 1649
203 1355
352 1560
582 1633
118 1858
771 1304
321 1251
392 1206
958 1070
684 1713
939 1999
592 1726
56 1867
592 1988
736 1842
958 1559
989 1906
183 1749
462 1407
294 1890
771 1725
1 1897
49 1062
124 1558
575 1327
506 1243
154 1403
672 1573
423 1160
222 1950
67 1904
664 1802
585 1438
327 1353
284 1803
369 1251
291 1294
61 1509
551 1861
938 1061
765 1678
509 1323
145 1822
887 1975
768 1646
610 1140
690 1793
763 1262
96 1287
837 1876
632 1819
747 1141
71 1442
561 1709
290 1050
514 1106
87 1416
762 1666
83 1070
467 1271
7 1152
472 1509
861 1016
913 1109
934 1154
288 1197
175 1244
588 1960
316 1946
543 1882
359 1614
465 1779
892 1726
695 1531
542 1461
288 1190
966 1558
736 1064
997 1750
885 1427
888 1064
342 1553
77 1234
845 1636
407 1181
354 1114
670 1836
69 1065
12 1432
982 1944
837 1518
231 1274
2 1155
423 1136
377 1012
353 1203
257 1205
350 1753
479 1238
324 1619
705 1382
236 1249
695 1195
213 1906
231 1368
819 1392
509 1785
661 1546
210 1123
873 1301
363 1029
216 1998
240 1351
667 1195
515 1136
230 1779
385 1750
574 1432
435 1830
804 1902
249 1360
303 1158
969 1732
249 1526
159 1575
139 1833
347 1342
661 1731
887 1859
19 1001
748 1763
829 1878
828 1086
835 1791
895 1387
326 1003
568 1049
485 1750
760 1171
414 1394
987 1379
851 1857
8 1594
76 1655
363 1189
90 1630
976 1005
57 1457
886 1166
29 1658
543 1710
379 1142
499 1112
177 1843
746 1808
454 1523
676 1465
762 1980
309 1286
74 1330
359 1949
781 1590
874 1658
455 1770
790 1487
651 1249
855 1143
386 1439
298 1007
2 1028
217 1428
318 1191
968 1588
5 1329
625 1475
140 1718
401 1543
936 1260
311 1625
711 1886
832 1395
114 1259
782 1156
434 1891
539 1855
448 1748
199 1518
735 1380
908 1798
301 1759
876 1155
63 1637
739 1461
558 1305
533 1177
801 1914
97 1422
423 1377
920 1775
215 1512
691 1628
905 1824
540 1573
567 1285
573 1665
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""train.py
"""
import argparse
import time
import glob
import os
import numpy as np
import pgl
from pgl.utils.logger import log
from pgl.utils import paddle_helper
import paddle
import paddle.fluid as fluid
import tqdm
import reader
import model
def get_layer(layer_type, gw, feature, hidden_size, act, name, is_test=False):
"""get_layer"""
return getattr(model, layer_type)(gw, feature, hidden_size, act, name)
def load_pos_neg(data_path):
"""load_pos_neg"""
train_eid = []
train_src = []
train_dst = []
with open(data_path) as f:
eid = 0
for idx, line in tqdm.tqdm(enumerate(f)):
src, dst = line.strip().split('\t')
train_src.append(int(src))
train_dst.append(int(dst))
train_eid.append(int(eid))
eid += 1
# concate the the pos data and neg data
train_eid = np.array(train_eid, dtype="int64")
train_src = np.array(train_src, dtype="int64")
train_dst = np.array(train_dst, dtype="int64")
returns = {"train_data": (train_src, train_dst, train_eid), }
return returns
def binary_op(u_embed, v_embed, binary_op_type):
"""binary_op"""
if binary_op_type == "Average":
edge_embed = (u_embed + v_embed) / 2
elif binary_op_type == "Hadamard":
edge_embed = u_embed * v_embed
elif binary_op_type == "Weighted-L1":
edge_embed = fluid.layers.abs(u_embed - v_embed)
elif binary_op_type == "Weighted-L2":
edge_embed = (u_embed - v_embed) * (u_embed - v_embed)
else:
raise ValueError(binary_op_type + " binary_op_type doesn't exists")
return edge_embed
class RetDict(object):
"""RetDict"""
pass
def build_graph_model(args):
"""build_graph_model"""
node_feature_info = [('index', [None], np.dtype('int64'))]
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
graph_wrappers = []
feed_list = []
graph_wrappers.append(
pgl.graph_wrapper.GraphWrapper(
"layer_0", fluid.CPUPlace(), node_feat=node_feature_info))
#edge_feat=[("f", [None, 1], "float32")]))
num_embed = args.num_nodes
num_layers = args.num_layers
src_index = fluid.layers.data(
"src_index", shape=[None], dtype="int64", append_batch_size=False)
dst_index = fluid.layers.data(
"dst_index", shape=[None], dtype="int64", append_batch_size=False)
feature = fluid.layers.embedding(
input=fluid.layers.reshape(graph_wrappers[0].node_feat['index'],
[-1, 1]),
size=(num_embed + 1, args.hidden_size),
is_sparse=args.is_sparse,
is_distributed=args.is_distributed)
features = [feature]
ret_dict = RetDict()
ret_dict.graph_wrappers = graph_wrappers
edge_data = [src_index, dst_index]
feed_list.extend(edge_data)
ret_dict.feed_list = feed_list
for i in range(num_layers):
if i == num_layers - 1:
act = None
else:
act = "leaky_relu"
feature = get_layer(
args.layer_type,
graph_wrappers[0],
feature,
args.hidden_size,
act,
name="%s_%s" % (args.layer_type, i))
features.append(feature)
src_feat = fluid.layers.gather(features[-1], src_index)
src_feat = fluid.layers.fc(src_feat,
args.hidden_size,
bias_attr=None,
param_attr=fluid.ParamAttr(name="feat"))
dst_feat = fluid.layers.gather(features[-1], dst_index)
dst_feat = fluid.layers.fc(dst_feat,
args.hidden_size,
bias_attr=None,
param_attr=fluid.ParamAttr(name="feat"))
if args.phase == "predict":
node_id = fluid.layers.data(
"node_id", shape=[None, 1], dtype="int64", append_batch_size=False)
ret_dict.src_feat = src_feat
ret_dict.dst_feat = dst_feat
ret_dict.id = node_id
return ret_dict
batch_size = args.batch_size
batch_negative_label = fluid.layers.reshape(
fluid.layers.range(0, batch_size, 1, "int64"), [-1, 1])
batch_negative_label = fluid.layers.one_hot(batch_negative_label,
batch_size)
batch_loss_weight = (batch_negative_label *
(batch_size - 2) + 1.0) / (batch_size - 1)
batch_loss_weight.stop_gradient = True
batch_negative_label = batch_negative_label
batch_negative_label = fluid.layers.cast(
batch_negative_label, dtype="float32")
batch_negative_label.stop_gradient = True
cos_theta = fluid.layers.matmul(src_feat, dst_feat, transpose_y=True)
# Calc Loss
loss = fluid.layers.sigmoid_cross_entropy_with_logits(
x=cos_theta, label=batch_negative_label)
loss = loss * batch_loss_weight
#loss = fluid.layers.reduce_sum(loss, -1)
loss = fluid.layers.mean(loss)
# Calc AUC
proba = fluid.layers.sigmoid(cos_theta)
proba = fluid.layers.reshape(proba, [-1, 1])
proba = fluid.layers.concat([proba * -1 + 1, proba], axis=1)
gold_label = fluid.layers.reshape(batch_negative_label, [-1, 1])
gold_label = fluid.layers.cast(gold_label, "int64")
auc, batch_auc_out, [batch_stat_pos, batch_stat_neg, stat_pos, stat_neg] = \
fluid.layers.auc(input=proba, label=gold_label, curve='ROC', )
ret_dict.loss = loss
ret_dict.auc = batch_auc_out
return ret_dict
def run_epoch(
py_reader,
exe,
program,
prefix,
model_dict,
epoch,
batch_size,
log_per_step=100,
save_per_step=10000, ):
"""run_epoch"""
batch = 0
start = time.time()
batch_end = time.time()
for batch_feed_dict in py_reader():
if prefix == "train":
if batch_feed_dict["src_index"].shape[0] != batch_size:
log.warning(
'batch_feed_dict["src_index"].shape[0] != 1024, continue')
continue
batch_start = time.time()
batch += 1
batch_loss, batch_auc = exe.run(
program,
feed=batch_feed_dict,
fetch_list=[model_dict.loss.name, model_dict.auc.name])
batch_end = time.time()
if batch % log_per_step == 0:
log.info(
"Batch %s %s-Loss %s \t %s-Auc %s \t Speed(per batch) %.5lf sec"
% (batch, prefix, np.mean(batch_loss), prefix,
np.mean(batch_auc), batch_end - batch_start))
if batch != 0 and batch % save_per_step == 0:
fluid.io.save_params(
exe, dirname='checkpoint', main_program=program)
fluid.io.save_params(exe, dirname='checkpoint', main_program=program)
def run_predict_epoch(py_reader,
exe,
program,
prefix,
model_dict,
num_nodes,
hidden_size,
log_per_step=100):
"""run_predict_epoch"""
batch = 0
start = time.time()
#use the parallel executor to speed up
batch_end = time.time()
all_feat = np.zeros((num_nodes, hidden_size), dtype="float32")
for batch_feed_dict in tqdm.tqdm(py_reader()):
batch_start = time.time()
batch += 1
batch_src_feat, batch_id = exe.run(
program,
feed=batch_feed_dict,
fetch_list=[model_dict.src_feat.name, model_dict.id.name])
for ind, id in enumerate(batch_id):
all_feat[id] = batch_src_feat[ind]
np.save("emb.npy", all_feat)
def main(args):
"""main"""
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
exe = fluid.Executor(place)
train_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(train_program, startup_program):
ret_dict = build_graph_model(args=args)
val_program = train_program.clone(for_test=True)
if args.phase == "train":
with fluid.program_guard(train_program, startup_program):
adam = fluid.optimizer.Adam(learning_rate=args.lr)
adam.minimize(ret_dict.loss)
# reset the place according to role of parameter server
exe.run(startup_program)
with open(args.data_path) as f:
log.info("Begin Load Graph")
src = []
dst = []
for idx, line in tqdm.tqdm(enumerate(f)):
s, d = line.strip().split()
src.append(s)
dst.append(d)
dst.append(s)
src.append(d)
src = np.array(src, dtype="int64").reshape(-1, 1)
dst = np.array(dst, dtype="int64").reshape(-1, 1)
edges = np.hstack([src, dst])
log.info("Begin Build Index")
ret_dict.graph = pgl.graph.Graph(num_nodes=args.num_nodes, edges=edges)
ret_dict.graph.indegree()
log.info("End Build Index")
if args.phase == "train":
#just the worker, load the sample
data = load_pos_neg(args.data_path)
feed_name_list = [var.name for var in ret_dict.feed_list]
train_iter = reader.graph_reader(
args.num_layers,
ret_dict.graph_wrappers,
batch_size=args.batch_size,
data=data['train_data'],
samples=args.samples,
num_workers=args.sample_workers,
feed_name_list=feed_name_list,
use_pyreader=args.use_pyreader,
graph=ret_dict.graph)
# get PyReader
for epoch in range(args.epoch):
epoch_start = time.time()
try:
run_epoch(
train_iter,
program=train_program,
exe=exe,
prefix="train",
model_dict=ret_dict,
epoch=epoch,
batch_size=args.batch_size,
log_per_step=1)
epoch_end = time.time()
print("Epoch: {0}, Train total expend: {1} ".format(
epoch, epoch_end - epoch_start))
except Exception as e:
log.info("Run Epoch Error %s" % e)
fluid.io.save_params(
exe,
dirname=args.checkpoint + '_%s' % (epoch + 1),
main_program=train_program)
log.info("EPOCH END")
log.info("RUN FINISH")
elif args.phase == "predict":
fluid.io.load_params(
exe,
dirname=args.checkpoint + '_%s' % args.epoch,
main_program=val_program)
test_src = np.arange(0, args.num_nodes, dtype="int64")
feed_name_list = [var.name for var in ret_dict.feed_list]
predict_iter = reader.graph_reader(
args.num_layers,
ret_dict.graph_wrappers,
batch_size=args.batch_size,
data=(test_src, test_src, test_src),
samples=args.samples,
num_workers=args.sample_workers,
feed_name_list=feed_name_list,
use_pyreader=args.use_pyreader,
graph=ret_dict.graph,
predict=True)
run_predict_epoch(
predict_iter,
program=val_program,
exe=exe,
prefix="predict",
hidden_size=args.hidden_size,
model_dict=ret_dict,
num_nodes=args.num_nodes,
log_per_step=100)
log.info("EPOCH END")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description='graphsage')
parser.add_argument(
"--use_cuda", action='store_true', help="use_cuda", default=False)
parser.add_argument("--layer_type", type=str, default="graphsage_mean")
parser.add_argument("--epoch", type=int, default=1)
parser.add_argument("--hidden_size", type=int, default=128)
parser.add_argument("--batch_size", type=int, default=1024)
parser.add_argument("--lr", type=float, default=0.001)
parser.add_argument("--num_layers", type=int, default=2)
parser.add_argument("--data_path", type=str, required=True)
parser.add_argument("--checkpoint", type=str, default="model_ckpt")
parser.add_argument("--cache_path", type=str, default="./tmp")
parser.add_argument("--phase", type=str, default="train")
parser.add_argument("--digraph", action='store_true', default=False)
parser.add_argument('--samples', nargs='+', type=int, default=[10, 10])
parser.add_argument("--sample_workers", type=int, default=10)
parser.add_argument("--num_nodes", type=int, required=True)
parser.add_argument("--is_sparse", action='store_true', default=False)
parser.add_argument("--is_distributed", action='store_true', default=False)
parser.add_argument("--real_graph", action='store_true', default=True)
parser.add_argument("--use_pyreader", action='store_true', default=False)
args = parser.parse_args()
log.info(args)
main(args)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""test ogb
"""
import argparse
import pgl
import numpy as np
import paddle.fluid as fluid
from pgl.contrib.ogb.graphproppred.dataset_pgl import PglGraphPropPredDataset
from pgl.utils import paddle_helper
from ogb.graphproppred import Evaluator
from pgl.contrib.ogb.graphproppred.mol_encoder import AtomEncoder, BondEncoder
def train(exe, batch_size, graph_wrapper, train_program, splitted_idx, dataset,
evaluator, fetch_loss, fetch_pred):
"""Train"""
graphs, labels = dataset[splitted_idx["train"]]
perm = np.arange(0, len(graphs))
np.random.shuffle(perm)
start_batch = 0
batch_no = 0
pred_output = np.zeros_like(labels, dtype="float32")
while start_batch < len(perm):
batch_index = perm[start_batch:start_batch + batch_size]
start_batch += batch_size
batch_graph = pgl.graph.MultiGraph(graphs[batch_index])
batch_label = labels[batch_index]
batch_valid = (batch_label == batch_label).astype("float32")
batch_label = np.nan_to_num(batch_label).astype("float32")
feed_dict = graph_wrapper.to_feed(batch_graph)
feed_dict["label"] = batch_label
feed_dict["weight"] = batch_valid
loss, pred = exe.run(train_program,
feed=feed_dict,
fetch_list=[fetch_loss, fetch_pred])
pred_output[batch_index] = pred
batch_no += 1
print("train", evaluator.eval({"y_true": labels, "y_pred": pred_output}))
def evaluate(exe, batch_size, graph_wrapper, val_program, splitted_idx,
dataset, mode, evaluator, fetch_pred):
"""Eval"""
graphs, labels = dataset[splitted_idx[mode]]
perm = np.arange(0, len(graphs))
start_batch = 0
batch_no = 0
pred_output = np.zeros_like(labels, dtype="float32")
while start_batch < len(perm):
batch_index = perm[start_batch:start_batch + batch_size]
start_batch += batch_size
batch_graph = pgl.graph.MultiGraph(graphs[batch_index])
feed_dict = graph_wrapper.to_feed(batch_graph)
pred = exe.run(val_program, feed=feed_dict, fetch_list=[fetch_pred])
pred_output[batch_index] = pred[0]
batch_no += 1
print(mode, evaluator.eval({"y_true": labels, "y_pred": pred_output}))
def send_func(src_feat, dst_feat, edge_feat):
"""Send"""
return src_feat["h"] + edge_feat["h"]
class GNNModel(object):
"""GNNModel"""
def __init__(self, name, emb_dim, num_task, num_layers):
self.num_task = num_task
self.emb_dim = emb_dim
self.num_layers = num_layers
self.name = name
self.atom_encoder = AtomEncoder(name=name, emb_dim=emb_dim)
self.bond_encoder = BondEncoder(name=name, emb_dim=emb_dim)
def forward(self, graph):
"""foward"""
h_node = self.atom_encoder(graph.node_feat['feat'])
h_edge = self.bond_encoder(graph.edge_feat['feat'])
for layer in range(self.num_layers):
msg = graph.send(
send_func,
nfeat_list=[("h", h_node)],
efeat_list=[("h", h_edge)])
h_node = graph.recv(msg, 'sum') + h_node
h_node = fluid.layers.fc(h_node,
size=self.emb_dim,
name=self.name + '_%s' % layer,
act="relu")
graph_nodes = pgl.layers.graph_pooling(graph, h_node, "average")
graph_pred = fluid.layers.fc(graph_nodes, self.num_task, name="final")
return graph_pred
def main():
"""main
"""
# Training settings
parser = argparse.ArgumentParser(description='Graph Dataset')
parser.add_argument(
'--epochs',
type=int,
default=100,
help='number of epochs to train (default: 100)')
parser.add_argument(
'--dataset',
type=str,
default="ogbg-mol-tox21",
help='dataset name (default: proteinfunc)')
args = parser.parse_args()
place = fluid.CPUPlace() # Dataset too big to use GPU
### automatic dataloading and splitting
dataset = PglGraphPropPredDataset(name=args.dataset)
splitted_idx = dataset.get_idx_split()
### automatic evaluator. takes dataset name as input
evaluator = Evaluator(args.dataset)
graph_data, label = dataset[:2]
batch_graph = pgl.graph.MultiGraph(graph_data)
graph_data = batch_graph
train_program = fluid.Program()
startup_program = fluid.Program()
test_program = fluid.Program()
# degree normalize
graph_data.edge_feat["feat"] = graph_data.edge_feat["feat"].astype("int64")
graph_data.node_feat["feat"] = graph_data.node_feat["feat"].astype("int64")
model = GNNModel(
name="gnn", num_task=dataset.num_tasks, emb_dim=64, num_layers=2)
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.GraphWrapper(
"graph",
place=place,
node_feat=graph_data.node_feat_info(),
edge_feat=graph_data.edge_feat_info())
pred = model.forward(gw)
sigmoid_pred = fluid.layers.sigmoid(pred)
val_program = train_program.clone(for_test=True)
initializer = []
with fluid.program_guard(train_program, startup_program):
train_label = fluid.layers.data(
name="label", dtype="float32", shape=[None, dataset.num_tasks])
train_weight = fluid.layers.data(
name="weight", dtype="float32", shape=[None, dataset.num_tasks])
train_loss_t = fluid.layers.sigmoid_cross_entropy_with_logits(
x=pred, label=train_label) * train_weight
train_loss_t = fluid.layers.reduce_sum(train_loss_t)
adam = fluid.optimizer.Adam(
learning_rate=1e-2,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.0005))
adam.minimize(train_loss_t)
exe = fluid.Executor(place)
exe.run(startup_program)
for epoch in range(1, args.epochs + 1):
print("Epoch", epoch)
train(exe, 128, gw, train_program, splitted_idx, dataset, evaluator,
train_loss_t, sigmoid_pred)
evaluate(exe, 128, gw, val_program, splitted_idx, dataset, "valid",
evaluator, sigmoid_pred)
evaluate(exe, 128, gw, val_program, splitted_idx, dataset, "test",
evaluator, sigmoid_pred)
if __name__ == "__main__":
main()
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""test ogb
"""
import argparse
import time
import logging
import numpy as np
import paddle.fluid as fluid
import pgl
from pgl.contrib.ogb.linkproppred.dataset_pgl import PglLinkPropPredDataset
from pgl.utils import paddle_helper
from ogb.linkproppred import Evaluator
def send_func(src_feat, dst_feat, edge_feat):
"""send_func"""
return src_feat["h"]
def recv_func(feat):
"""recv_func"""
return fluid.layers.sequence_pool(feat, pool_type="sum")
class GNNModel(object):
"""GNNModel"""
def __init__(self, name, num_nodes, emb_dim, num_layers):
self.num_nodes = num_nodes
self.emb_dim = emb_dim
self.num_layers = num_layers
self.name = name
self.src_nodes = fluid.layers.data(
name='src_nodes',
shape=[None],
dtype='int64', )
self.dst_nodes = fluid.layers.data(
name='dst_nodes',
shape=[None],
dtype='int64', )
self.edge_label = fluid.layers.data(
name='edge_label',
shape=[None, 1],
dtype='float32', )
def forward(self, graph):
"""forward"""
h = fluid.layers.create_parameter(
shape=[self.num_nodes, self.emb_dim],
dtype="float32",
name=self.name + "_embedding")
for layer in range(self.num_layers):
msg = graph.send(
send_func,
nfeat_list=[("h", h)], )
h = graph.recv(msg, recv_func)
h = fluid.layers.fc(
h,
size=self.emb_dim,
bias_attr=False,
param_attr=fluid.ParamAttr(name=self.name + '_%s' % layer))
h = h * graph.node_feat["norm"]
bias = fluid.layers.create_parameter(
shape=[self.emb_dim],
dtype='float32',
is_bias=True,
name=self.name + '_bias_%s' % layer)
h = fluid.layers.elementwise_add(h, bias, act="relu")
src = fluid.layers.gather(h, self.src_nodes, overwrite=False)
dst = fluid.layers.gather(h, self.dst_nodes, overwrite=False)
edge_embed = src * dst
pred = fluid.layers.fc(input=edge_embed,
size=1,
name=self.name + "_pred_output")
prob = fluid.layers.sigmoid(pred)
loss = fluid.layers.sigmoid_cross_entropy_with_logits(pred,
self.edge_label)
loss = fluid.layers.reduce_sum(loss)
return pred, prob, loss
def main():
"""main
"""
# Training settings
parser = argparse.ArgumentParser(description='Graph Dataset')
parser.add_argument(
'--epochs',
type=int,
default=4,
help='number of epochs to train (default: 100)')
parser.add_argument(
'--dataset',
type=str,
default="ogbl-ppa",
help='dataset name (default: protein protein associations)')
parser.add_argument('--use_cuda', action='store_true')
parser.add_argument('--batch_size', type=int, default=5120)
parser.add_argument('--embed_dim', type=int, default=64)
parser.add_argument('--num_layers', type=int, default=2)
parser.add_argument('--lr', type=float, default=0.001)
args = parser.parse_args()
print(args)
place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace()
### automatic dataloading and splitting
print("loadding dataset")
dataset = PglLinkPropPredDataset(name=args.dataset)
splitted_edge = dataset.get_edge_split()
print(splitted_edge['train_edge'].shape)
print(splitted_edge['train_edge_label'].shape)
print("building evaluator")
### automatic evaluator. takes dataset name as input
evaluator = Evaluator(args.dataset)
graph_data = dataset[0]
print("num_nodes: %d" % graph_data.num_nodes)
train_program = fluid.Program()
startup_program = fluid.Program()
# degree normalize
indegree = graph_data.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
graph_data.node_feat["norm"] = np.expand_dims(norm, -1).astype("float32")
# graph_data.node_feat["index"] = np.array([i for i in range(graph_data.num_nodes)], dtype=np.int64).reshape(-1,1)
with fluid.program_guard(train_program, startup_program):
model = GNNModel(
name="gnn",
num_nodes=graph_data.num_nodes,
emb_dim=args.embed_dim,
num_layers=args.num_layers)
gw = pgl.graph_wrapper.GraphWrapper(
"graph",
place,
node_feat=graph_data.node_feat_info(),
edge_feat=graph_data.edge_feat_info())
pred, prob, loss = model.forward(gw)
val_program = train_program.clone(for_test=True)
with fluid.program_guard(train_program, startup_program):
global_steps = int(splitted_edge['train_edge'].shape[0] /
args.batch_size * 2)
learning_rate = fluid.layers.polynomial_decay(args.lr, global_steps,
0.00005)
adam = fluid.optimizer.Adam(
learning_rate=learning_rate,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.0005))
adam.minimize(loss)
exe = fluid.Executor(place)
exe.run(startup_program)
feed = gw.to_feed(graph_data)
print("evaluate result before training: ")
result = test(exe, val_program, prob, evaluator, feed, splitted_edge)
print(result)
print("training")
cc = 0
for epoch in range(1, args.epochs + 1):
for batch_data, batch_label in data_generator(
graph_data,
splitted_edge["train_edge"],
splitted_edge["train_edge_label"],
batch_size=args.batch_size):
feed['src_nodes'] = batch_data[:, 0].reshape(-1, 1)
feed['dst_nodes'] = batch_data[:, 1].reshape(-1, 1)
feed['edge_label'] = batch_label.astype("float32")
res_loss, y_pred, b_lr = exe.run(
train_program,
feed=feed,
fetch_list=[loss, prob, learning_rate])
if cc % 1 == 0:
print("epoch %d | step %d | lr %s | Loss %s" %
(epoch, cc, b_lr[0], res_loss[0]))
cc += 1
if cc % 20 == 0:
print("Evaluating...")
result = test(exe, val_program, prob, evaluator, feed,
splitted_edge)
print("epoch %d | step %d" % (epoch, cc))
print(result)
def test(exe, val_program, prob, evaluator, feed, splitted_edge):
"""Evaluation"""
result = {}
feed['src_nodes'] = splitted_edge["valid_edge"][:, 0].reshape(-1, 1)
feed['dst_nodes'] = splitted_edge["valid_edge"][:, 1].reshape(-1, 1)
feed['edge_label'] = splitted_edge["valid_edge_label"].astype(
"float32").reshape(-1, 1)
y_pred = exe.run(val_program, feed=feed, fetch_list=[prob])[0]
input_dict = {
"y_pred_pos":
y_pred[splitted_edge["valid_edge_label"] == 1].reshape(-1, ),
"y_pred_neg":
y_pred[splitted_edge["valid_edge_label"] == 0].reshape(-1, )
}
result["valid"] = evaluator.eval(input_dict)
feed['src_nodes'] = splitted_edge["test_edge"][:, 0].reshape(-1, 1)
feed['dst_nodes'] = splitted_edge["test_edge"][:, 1].reshape(-1, 1)
feed['edge_label'] = splitted_edge["test_edge_label"].astype(
"float32").reshape(-1, 1)
y_pred = exe.run(val_program, feed=feed, fetch_list=[prob])[0]
input_dict = {
"y_pred_pos":
y_pred[splitted_edge["test_edge_label"] == 1].reshape(-1, ),
"y_pred_neg":
y_pred[splitted_edge["test_edge_label"] == 0].reshape(-1, )
}
result["test"] = evaluator.eval(input_dict)
return result
def data_generator(graph, data, label_data, batch_size, shuffle=True):
"""Data Generator"""
perm = np.arange(0, len(data))
if shuffle:
np.random.shuffle(perm)
offset = 0
while offset < len(perm):
batch_index = perm[offset:(offset + batch_size)]
offset += batch_size
pos_data = data[batch_index]
pos_label = label_data[batch_index]
neg_src_node = pos_data[:, 0]
neg_dst_node = np.random.choice(
pos_data.reshape(-1, ), size=len(neg_src_node))
neg_data = np.hstack(
[neg_src_node.reshape(-1, 1), neg_dst_node.reshape(-1, 1)])
exists = graph.has_edges_between(neg_src_node, neg_dst_node)
neg_data = neg_data[np.invert(exists)]
neg_label = np.zeros(shape=len(neg_data), dtype=np.int64)
batch_data = np.vstack([pos_data, neg_data])
label = np.vstack([pos_label.reshape(-1, 1), neg_label.reshape(-1, 1)])
yield batch_data, label
if __name__ == "__main__":
main()
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""test ogb
"""
import argparse
import pgl
import numpy as np
import paddle.fluid as fluid
from pgl.contrib.ogb.nodeproppred.dataset_pgl import PglNodePropPredDataset
from pgl.utils import paddle_helper
from ogb.nodeproppred import Evaluator
def train():
pass
def send_func(src_feat, dst_feat, edge_feat):
return (src_feat["h"] + edge_feat["h"]) * src_feat["norm"]
class GNNModel(object):
def __init__(self, name, emb_dim, num_task, num_layers):
self.num_task = num_task
self.emb_dim = emb_dim
self.num_layers = num_layers
self.name = name
def forward(self, graph):
h = fluid.layers.embedding(
graph.node_feat["x"],
size=(2, self.emb_dim)) # name=self.name + "_embedding")
edge_attr = fluid.layers.fc(graph.edge_feat["feat"], size=self.emb_dim)
for layer in range(self.num_layers):
msg = graph.send(
send_func,
nfeat_list=[("h", h), ("norm", graph.node_feat["norm"])],
efeat_list=[("h", edge_attr)])
h = graph.recv(msg, "sum")
h = fluid.layers.fc(
h,
size=self.emb_dim,
bias_attr=False,
param_attr=fluid.ParamAttr(name=self.name + '_%s' % layer))
h = h * graph.node_feat["norm"]
bias = fluid.layers.create_parameter(
shape=[self.emb_dim],
dtype='float32',
is_bias=True,
name=self.name + '_bias_%s' % layer)
h = fluid.layers.elementwise_add(h, bias, act="relu")
pred = fluid.layers.fc(h,
self.num_task,
act=None,
name=self.name + "_pred_output")
return pred
def main():
"""main
"""
# Training settings
parser = argparse.ArgumentParser(description='Graph Dataset')
parser.add_argument(
'--epochs',
type=int,
default=100,
help='number of epochs to train (default: 100)')
parser.add_argument(
'--dataset',
type=str,
default="ogbn-proteins",
help='dataset name (default: proteinfunc)')
args = parser.parse_args()
#device = torch.device("cuda:" + str(args.device)) if torch.cuda.is_available() else torch.device("cpu")
#place = fluid.CUDAPlace(0)
place = fluid.CPUPlace() # Dataset too big to use GPU
### automatic dataloading and splitting
dataset = PglNodePropPredDataset(name=args.dataset)
splitted_idx = dataset.get_idx_split()
### automatic evaluator. takes dataset name as input
evaluator = Evaluator(args.dataset)
graph_data, label = dataset[0]
train_program = fluid.Program()
startup_program = fluid.Program()
test_program = fluid.Program()
# degree normalize
indegree = graph_data.indegree()
norm = np.zeros_like(indegree, dtype="float32")
norm[indegree > 0] = np.power(indegree[indegree > 0], -0.5)
graph_data.node_feat["norm"] = np.expand_dims(norm, -1).astype("float32")
graph_data.node_feat["x"] = np.zeros((len(indegree), 1), dtype="int64")
graph_data.edge_feat["feat"] = graph_data.edge_feat["feat"].astype(
"float32")
model = GNNModel(
name="gnn", num_task=dataset.num_tasks, emb_dim=64, num_layers=2)
with fluid.program_guard(train_program, startup_program):
gw = pgl.graph_wrapper.StaticGraphWrapper("graph", graph_data, place)
pred = model.forward(gw)
sigmoid_pred = fluid.layers.sigmoid(pred)
val_program = train_program.clone(for_test=True)
initializer = []
with fluid.program_guard(train_program, startup_program):
train_node_index, init = paddle_helper.constant(
"train_node_index", dtype="int64", value=splitted_idx["train"])
initializer.append(init)
train_node_label, init = paddle_helper.constant(
"train_node_label",
dtype="float32",
value=label[splitted_idx["train"]].astype("float32"))
initializer.append(init)
train_pred_t = fluid.layers.gather(pred, train_node_index)
train_loss_t = fluid.layers.sigmoid_cross_entropy_with_logits(
x=train_pred_t, label=train_node_label)
train_loss_t = fluid.layers.reduce_sum(train_loss_t)
train_pred_t = fluid.layers.sigmoid(train_pred_t)
adam = fluid.optimizer.Adam(
learning_rate=1e-2,
regularization=fluid.regularizer.L2DecayRegularizer(
regularization_coeff=0.0005))
adam.minimize(train_loss_t)
exe = fluid.Executor(place)
exe.run(startup_program)
gw.initialize(place)
for init in initializer:
init(place)
for epoch in range(1, args.epochs + 1):
loss = exe.run(train_program, feed={}, fetch_list=[train_loss_t])
print("Loss %s" % loss[0])
print("Evaluating...")
y_pred = exe.run(val_program, feed={}, fetch_list=[sigmoid_pred])[0]
result = {}
input_dict = {
"y_true": label[splitted_idx["train"]],
"y_pred": y_pred[splitted_idx["train"]]
}
result["train"] = evaluator.eval(input_dict)
input_dict = {
"y_true": label[splitted_idx["valid"]],
"y_pred": y_pred[splitted_idx["valid"]]
}
result["valid"] = evaluator.eval(input_dict)
input_dict = {
"y_true": label[splitted_idx["test"]],
"y_pred": y_pred[splitted_idx["test"]]
}
result["test"] = evaluator.eval(input_dict)
print(result)
if __name__ == "__main__":
main()
......@@ -13,8 +13,11 @@
# limitations under the License.
"""Generate pgl apis
"""
__version__ = "0.1.0.beta"
__version__ = "1.0.2"
from pgl import layers
from pgl import graph_wrapper
from pgl import graph
from pgl import data_loader
from pgl import heter_graph
from pgl import heter_graph_wrapper
from pgl import contrib
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""__init__.py"""
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PglGraphPropPredDataset
"""
import pandas as pd
import shutil, os
import os.path as osp
import numpy as np
from ogb.utils.url import decide_download, download_url, extract_zip
from ogb.graphproppred import make_master_file
from pgl.contrib.ogb.io.read_graph_pgl import read_csv_graph_pgl
def to_bool(value):
"""to_bool"""
return np.array([value], dtype="bool")[0]
class PglGraphPropPredDataset(object):
"""PglGraphPropPredDataset"""
def __init__(self, name, root="dataset"):
self.name = name ## original name, e.g., ogbg-mol-tox21
self.dir_name = "_".join(
name.split("-")
) + "_pgl" ## replace hyphen with underline, e.g., ogbg_mol_tox21_dgl
self.original_root = root
self.root = osp.join(root, self.dir_name)
self.meta_info = make_master_file.df #pd.read_csv(
#os.path.join(os.path.dirname(__file__), "master.csv"), index_col=0)
if not self.name in self.meta_info:
print(self.name)
error_mssg = "Invalid dataset name {}.\n".format(self.name)
error_mssg += "Available datasets are as follows:\n"
error_mssg += "\n".join(self.meta_info.keys())
raise ValueError(error_mssg)
self.download_name = self.meta_info[self.name][
"download_name"] ## name of downloaded file, e.g., tox21
self.num_tasks = int(self.meta_info[self.name]["num tasks"])
self.task_type = self.meta_info[self.name]["task type"]
super(PglGraphPropPredDataset, self).__init__()
self.pre_process()
def pre_process(self):
"""Pre-processing"""
processed_dir = osp.join(self.root, 'processed')
raw_dir = osp.join(self.root, 'raw')
pre_processed_file_path = osp.join(processed_dir, 'pgl_data_processed')
if os.path.exists(pre_processed_file_path):
# TODO: Load Preprocessed
pass
else:
### download
url = self.meta_info[self.name]["url"]
if decide_download(url):
path = download_url(url, self.original_root)
extract_zip(path, self.original_root)
os.unlink(path)
# delete folder if there exists
try:
shutil.rmtree(self.root)
except:
pass
shutil.move(
osp.join(self.original_root, self.download_name),
self.root)
else:
print("Stop download.")
exit(-1)
### preprocess
add_inverse_edge = to_bool(self.meta_info[self.name][
"add_inverse_edge"])
self.graphs = read_csv_graph_pgl(
raw_dir, add_inverse_edge=add_inverse_edge)
self.graphs = np.array(self.graphs)
self.labels = np.array(
pd.read_csv(
osp.join(raw_dir, "graph-label.csv.gz"),
compression="gzip",
header=None).values)
# TODO: Load Graph
### load preprocessed files
def get_idx_split(self):
"""Train/Valid/Test split"""
split_type = self.meta_info[self.name]["split"]
path = osp.join(self.root, "split", split_type)
train_idx = pd.read_csv(
osp.join(path, "train.csv.gz"), compression="gzip",
header=None).values.T[0]
valid_idx = pd.read_csv(
osp.join(path, "valid.csv.gz"), compression="gzip",
header=None).values.T[0]
test_idx = pd.read_csv(
osp.join(path, "test.csv.gz"), compression="gzip",
header=None).values.T[0]
return {
"train": np.array(
train_idx, dtype="int64"),
"valid": np.array(
valid_idx, dtype="int64"),
"test": np.array(
test_idx, dtype="int64")
}
def __getitem__(self, idx):
"""Get datapoint with index"""
return self.graphs[idx], self.labels[idx]
def __len__(self):
"""Length of the dataset
Returns
-------
int
Length of Dataset
"""
return len(self.graphs)
def __repr__(self): # pragma: no cover
return '{}({})'.format(self.__class__.__name__, len(self))
if __name__ == "__main__":
pgl_dataset = PglGraphPropPredDataset(name="ogbg-mol-bace")
splitted_index = pgl_dataset.get_idx_split()
print(pgl_dataset)
print(pgl_dataset[3:20])
#print(pgl_dataset[splitted_index["train"]])
#print(pgl_dataset[splitted_index["valid"]])
#print(pgl_dataset[splitted_index["test"]])
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""MolEncoder for ogb
"""
import paddle.fluid as fluid
from ogb.utils.features import get_atom_feature_dims, get_bond_feature_dims
class AtomEncoder(object):
"""AtomEncoder for encoding node features"""
def __init__(self, name, emb_dim):
self.emb_dim = emb_dim
self.name = name
def __call__(self, x):
atom_feature = get_atom_feature_dims()
atom_input = fluid.layers.split(
x, num_or_sections=len(atom_feature), dim=-1)
outputs = None
count = 0
for _x, _atom_input_dim in zip(atom_input, atom_feature):
count += 1
emb = fluid.layers.embedding(
_x,
size=(_atom_input_dim, self.emb_dim),
param_attr=fluid.ParamAttr(
name=self.name + '_atom_feat_%s' % count))
if outputs is None:
outputs = emb
else:
outputs = outputs + emb
return outputs
class BondEncoder(object):
"""Bond for encoding edge features"""
def __init__(self, name, emb_dim):
self.emb_dim = emb_dim
self.name = name
def __call__(self, x):
bond_feature = get_bond_feature_dims()
bond_input = fluid.layers.split(
x, num_or_sections=len(bond_feature), dim=-1)
outputs = None
count = 0
for _x, _bond_input_dim in zip(bond_input, bond_feature):
count += 1
emb = fluid.layers.embedding(
_x,
size=(_bond_input_dim, self.emb_dim),
param_attr=fluid.ParamAttr(
name=self.name + '_bond_feat_%s' % count))
if outputs is None:
outputs = emb
else:
outputs = outputs + emb
return outputs
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""__init__.py
"""
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""pgl read_csv_graph for ogb
"""
import pandas as pd
import os.path as osp
import numpy as np
import pgl
from ogb.io.read_graph_raw import read_csv_graph_raw
def read_csv_graph_pgl(raw_dir, add_inverse_edge=False):
"""Read CSV data and build PGL Graph
"""
graph_list = read_csv_graph_raw(raw_dir, add_inverse_edge)
pgl_graph_list = []
for graph in graph_list:
edges = list(zip(graph["edge_index"][0], graph["edge_index"][1]))
g = pgl.graph.Graph(num_nodes=graph["num_nodes"], edges=edges)
if graph["edge_feat"] is not None:
g.edge_feat["feat"] = graph["edge_feat"]
if graph["node_feat"] is not None:
g.node_feat["feat"] = graph["node_feat"]
pgl_graph_list.append(g)
return pgl_graph_list
if __name__ == "__main__":
# graph_list = read_csv_graph_dgl('dataset/proteinfunc_v2/raw', add_inverse_edge = True)
graph_list = read_csv_graph_pgl(
'dataset/ogbn_proteins_pgl/raw', add_inverse_edge=True)
print(graph_list)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""__init__.py
"""
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""LinkPropPredDataset for pgl
"""
import pandas as pd
import shutil, os
import os.path as osp
import numpy as np
from ogb.utils.url import decide_download, download_url, extract_zip
from ogb.linkproppred import make_master_file
from pgl.contrib.ogb.io.read_graph_pgl import read_csv_graph_pgl
def to_bool(value):
"""to_bool"""
return np.array([value], dtype="bool")[0]
class PglLinkPropPredDataset(object):
"""PglLinkPropPredDataset
"""
def __init__(self, name, root="dataset"):
self.name = name ## original name, e.g., ogbl-ppa
self.dir_name = "_".join(name.split(
"-")) + "_pgl" ## replace hyphen with underline, e.g., ogbl_ppa_pgl
self.original_root = root
self.root = osp.join(root, self.dir_name)
self.meta_info = make_master_file.df #pd.read_csv(os.path.join(os.path.dirname(__file__), "master.csv"), index_col=0)
if not self.name in self.meta_info:
print(self.name)
error_mssg = "Invalid dataset name {}.\n".format(self.name)
error_mssg += "Available datasets are as follows:\n"
error_mssg += "\n".join(self.meta_info.keys())
raise ValueError(error_mssg)
self.download_name = self.meta_info[self.name][
"download_name"] ## name of downloaded file, e.g., ppassoc
self.task_type = self.meta_info[self.name]["task type"]
super(PglLinkPropPredDataset, self).__init__()
self.pre_process()
def pre_process(self):
"""pre_process downlaoding data
"""
processed_dir = osp.join(self.root, 'processed')
pre_processed_file_path = osp.join(processed_dir, 'pgl_data_processed')
if osp.exists(pre_processed_file_path):
#TODO: Reload Preprocess files
pass
else:
### check download
if not osp.exists(osp.join(self.root, "raw", "edge.csv.gz")):
url = self.meta_info[self.name]["url"]
if decide_download(url):
path = download_url(url, self.original_root)
extract_zip(path, self.original_root)
os.unlink(path)
# delete folder if there exists
try:
shutil.rmtree(self.root)
except:
pass
shutil.move(
osp.join(self.original_root, self.download_name),
self.root)
else:
print("Stop download.")
exit(-1)
raw_dir = osp.join(self.root, "raw")
### pre-process and save
add_inverse_edge = to_bool(self.meta_info[self.name][
"add_inverse_edge"])
self.graph = read_csv_graph_pgl(
raw_dir, add_inverse_edge=add_inverse_edge)
#TODO: SAVE preprocess graph
def get_edge_split(self):
"""Train/Validation/Test split
"""
split_type = self.meta_info[self.name]["split"]
path = osp.join(self.root, "split", split_type)
train_idx = pd.read_csv(
osp.join(path, "train.csv.gz"), compression="gzip",
header=None).values
valid_idx = pd.read_csv(
osp.join(path, "valid.csv.gz"), compression="gzip",
header=None).values
test_idx = pd.read_csv(
osp.join(path, "test.csv.gz"), compression="gzip",
header=None).values
if self.task_type == "link prediction":
target_type = np.int64
else:
target_type = np.float32
return {
"train_edge": np.array(
train_idx[:, :2], dtype="int64"),
"train_edge_label": np.array(
train_idx[:, 2], dtype=target_type),
"valid_edge": np.array(
valid_idx[:, :2], dtype="int64"),
"valid_edge_label": np.array(
valid_idx[:, 2], dtype=target_type),
"test_edge": np.array(
test_idx[:, :2], dtype="int64"),
"test_edge_label": np.array(
test_idx[:, 2], dtype=target_type)
}
def __getitem__(self, idx):
assert idx == 0, "This dataset has only one graph"
return self.graph[0]
def __len__(self):
return 1
def __repr__(self): # pragma: no cover
return '{}({})'.format(self.__class__.__name__, len(self))
if __name__ == "__main__":
pgl_dataset = PglLinkPropPredDataset(name="ogbl-ppa")
splitted_edge = pgl_dataset.get_edge_split()
print(pgl_dataset[0])
print(splitted_edge)
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""__init__.py
"""
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""NodePropPredDataset for pgl
"""
import pandas as pd
import shutil, os
import os.path as osp
import numpy as np
from ogb.utils.url import decide_download, download_url, extract_zip
from ogb.nodeproppred import make_master_file # create master.csv
from pgl.contrib.ogb.io.read_graph_pgl import read_csv_graph_pgl
def to_bool(value):
"""to_bool"""
return np.array([value], dtype="bool")[0]
class PglNodePropPredDataset(object):
"""PglNodePropPredDataset
"""
def __init__(self, name, root="dataset"):
self.name = name ## original name, e.g., ogbn-proteins
self.dir_name = "_".join(
name.split("-")
) + "_pgl" ## replace hyphen with underline, e.g., ogbn_proteins_pgl
self.original_root = root
self.root = osp.join(root, self.dir_name)
self.meta_info = make_master_file.df #pd.read_csv(
#os.path.join(os.path.dirname(__file__), "master.csv"), index_col=0)
if not self.name in self.meta_info:
error_mssg = "Invalid dataset name {}.\n".format(self.name)
error_mssg += "Available datasets are as follows:\n"
error_mssg += "\n".join(self.meta_info.keys())
raise ValueError(error_mssg)
self.download_name = self.meta_info[self.name][
"download_name"] ## name of downloaded file, e.g., tox21
self.num_tasks = int(self.meta_info[self.name]["num tasks"])
self.task_type = self.meta_info[self.name]["task type"]
super(PglNodePropPredDataset, self).__init__()
self.pre_process()
def pre_process(self):
"""pre_process downlaoding data
"""
processed_dir = osp.join(self.root, 'processed')
pre_processed_file_path = osp.join(processed_dir, 'pgl_data_processed')
if osp.exists(pre_processed_file_path):
# TODO: Reload Preprocess files
pass
else:
### check download
if not osp.exists(osp.join(self.root, "raw", "edge.csv.gz")):
url = self.meta_info[self.name]["url"]
if decide_download(url):
path = download_url(url, self.original_root)
extract_zip(path, self.original_root)
os.unlink(path)
# delete folder if there exists
try:
shutil.rmtree(self.root)
except:
pass
shutil.move(
osp.join(self.original_root, self.download_name),
self.root)
else:
print("Stop download.")
exit(-1)
raw_dir = osp.join(self.root, "raw")
### pre-process and save
add_inverse_edge = to_bool(self.meta_info[self.name][
"add_inverse_edge"])
self.graph = read_csv_graph_pgl(
raw_dir, add_inverse_edge=add_inverse_edge)
### adding prediction target
node_label = pd.read_csv(
osp.join(raw_dir, 'node-label.csv.gz'),
compression="gzip",
header=None).values
if "classification" in self.task_type:
node_label = np.array(node_label, dtype=np.int64)
else:
node_label = np.array(node_label, dtype=np.float32)
label_dict = {"labels": node_label}
# TODO: SAVE preprocess graph
self.labels = label_dict['labels']
def get_idx_split(self):
"""Train/Validation/Test split
"""
split_type = self.meta_info[self.name]["split"]
path = osp.join(self.root, "split", split_type)
train_idx = pd.read_csv(
osp.join(path, "train.csv.gz"), compression="gzip",
header=None).values.T[0]
valid_idx = pd.read_csv(
osp.join(path, "valid.csv.gz"), compression="gzip",
header=None).values.T[0]
test_idx = pd.read_csv(
osp.join(path, "test.csv.gz"), compression="gzip",
header=None).values.T[0]
return {
"train": np.array(
train_idx, dtype="int64"),
"valid": np.array(
valid_idx, dtype="int64"),
"test": np.array(
test_idx, dtype="int64")
}
def __getitem__(self, idx):
assert idx == 0, "This dataset has only one graph"
return self.graph[idx], self.labels
def __len__(self):
return 1
def __repr__(self): # pragma: no cover
return '{}({})'.format(self.__class__.__name__, len(self))
if __name__ == "__main__":
pgl_dataset = PglNodePropPredDataset(name="ogbn-proteins")
splitted_index = pgl_dataset.get_idx_split()
print(pgl_dataset[0])
print(splitted_index)
......@@ -20,7 +20,6 @@ import io
import sys
import numpy as np
import pickle as pkl
import networkx as nx
from pgl import graph
from pgl.utils.logger import log
......@@ -91,6 +90,7 @@ class CitationDataset(object):
def _load_data(self):
"""Load data
"""
import networkx as nx
objnames = ['x', 'y', 'tx', 'ty', 'allx', 'ally', 'graph']
objects = []
for i in range(len(objnames)):
......@@ -98,7 +98,7 @@ class CitationDataset(object):
'rb') as f:
objects.append(_pickle_load(f))
x, y, tx, ty, allx, ally, _graph = tuple(objects)
x, y, tx, ty, allx, ally, _graph = objects
test_idx_reorder = _parse_index_file("{}/ind.{}.test.index".format(
self.path, self.name))
test_idx_range = np.sort(test_idx_reorder)
......
......@@ -15,12 +15,14 @@
This package implement Graph structure for handling graph data.
"""
import os
import numpy as np
import pickle as pkl
import time
import pgl.graph_kernel as graph_kernel
from collections import defaultdict
__all__ = ['Graph', 'SubGraph']
__all__ = ['Graph', 'SubGraph', 'MultiGraph']
def _hide_num_nodes(shape):
......@@ -43,8 +45,8 @@ class EdgeIndex(object):
"""
def __init__(self, u, v, num_nodes):
self._v, self._eid, self._degree, self._sorted_u,\
self._sorted_v, self._sorted_eid = graph_kernel.build_index(u, v, num_nodes)
self._degree, self._sorted_v, self._sorted_u, \
self._sorted_eid, self._indptr = graph_kernel.build_index(u, v, num_nodes)
@property
def degree(self):
......@@ -52,23 +54,40 @@ class EdgeIndex(object):
"""
return self._degree
@property
def v(self):
"""Return the compressed v.
def view_v(self, u=None):
"""Return the compressed v for given u.
"""
return self._v
if u is None:
return np.split(self._sorted_v, self._indptr[1:])
else:
u = np.array(u, dtype="int64")
return graph_kernel.slice_by_index(
self._sorted_v, self._indptr, index=u)
@property
def eid(self):
"""Return the edge id.
def view_eid(self, u=None):
"""Return the compressed edge id for given u.
"""
return self._eid
if u is None:
return np.split(self._sorted_eid, self._indptr[1:])
else:
u = np.array(u, dtype="int64")
return graph_kernel.slice_by_index(
self._sorted_eid, self._indptr, index=u)
def triples(self):
"""Return the sorted (u, v, eid) tuples.
"""
return self._sorted_u, self._sorted_v, self._sorted_eid
def dump(self, path):
if not os.path.exists(path):
os.makedirs(path)
np.save(os.path.join(path, 'degree.npy'), self._degree)
np.save(os.path.join(path, 'sorted_u.npy'), self._sorted_u)
np.save(os.path.join(path, 'sorted_v.npy'), self._sorted_v)
np.save(os.path.join(path, 'sorted_eid.npy'), self._sorted_eid)
np.save(os.path.join(path, 'indptr.npy'), self._indptr)
class Graph(object):
"""Implementation of graph structure in pgl.
......@@ -114,25 +133,76 @@ class Graph(object):
self._edge_feat = {}
if isinstance(edges, np.ndarray):
if edges.dtype != "int32":
edges = edges.astype("int32")
if edges.dtype != "int64":
edges = edges.astype("int64")
else:
edges = np.array(edges, dtype="int32")
edges = np.array(edges, dtype="int64")
self._edges = edges
self._num_nodes = num_nodes
if len(edges) == 0:
# check emtpy edges
src, dst = np.array([], dtype="int32"), np.array([], dtype="int32")
else:
src = edges[:, 0]
dst = edges[:, 1]
self._adj_src_index = None
self._adj_dst_index = None
self.indegree()
self._num_graph = 1
self._graph_lod = np.array([0, self.num_nodes], dtype="int32")
def dump(self, path):
if not os.path.exists(path):
os.makedirs(path)
np.save(os.path.join(path, 'num_nodes.npy'), self._num_nodes)
np.save(os.path.join(path, 'edges.npy'), self._edges)
if self._adj_src_index:
self._adj_src_index.dump(os.path.join(path, 'adj_src'))
if self._adj_dst_index:
self._adj_dst_index.dump(os.path.join(path, 'adj_dst'))
def dump_feat(feat_path, feat):
"""Dump all features to .npy file.
"""
if len(feat) == 0:
return
if not os.path.exists(feat_path):
os.makedirs(feat_path)
for key in feat:
np.save(os.path.join(feat_path, key + ".npy"), feat[key])
dump_feat(os.path.join(path, "node_feat"), self.node_feat)
dump_feat(os.path.join(path, "edge_feat"), self.edge_feat)
@property
def adj_src_index(self):
"""Return an EdgeIndex object for src.
"""
if self._adj_src_index is None:
if len(self._edges) == 0:
u = np.array([], dtype="int64")
v = np.array([], dtype="int64")
else:
u = self._edges[:, 0]
v = self._edges[:, 1]
self._adj_src_index = EdgeIndex(
u=u, v=v, num_nodes=self._num_nodes)
return self._adj_src_index
@property
def adj_dst_index(self):
"""Return an EdgeIndex object for dst.
"""
if self._adj_dst_index is None:
if len(self._edges) == 0:
v = np.array([], dtype="int64")
u = np.array([], dtype="int64")
else:
v = self._edges[:, 0]
u = self._edges[:, 1]
self._adj_src_index = EdgeIndex(
u=src, v=dst, num_nodes=self._num_nodes)
self._adj_dst_index = EdgeIndex(
u=dst, v=src, num_nodes=self._num_nodes)
self._adj_dst_index = EdgeIndex(
u=u, v=v, num_nodes=self._num_nodes)
return self._adj_dst_index
@property
def edge_feat(self):
......@@ -180,16 +250,16 @@ class Graph(object):
if sort_by not in ["src", "dst"]:
raise ValueError("sort_by should be in 'src' or 'dst'.")
if sort_by == 'src':
src, dst, eid = self._adj_src_index.triples()
src, dst, eid = self.adj_src_index.triples()
else:
dst, src, eid = self._adj_dst_index.triples()
dst, src, eid = self.adj_dst_index.triples()
return src, dst, eid
@property
def nodes(self):
"""Return all nodes id from 0 to :code:`num_nodes - 1`
"""
return np.arange(self._num_nodes, dtype="int32")
return np.arange(self._num_nodes, dtype="int64")
def indegree(self, nodes=None):
"""Return the indegree of the given nodes
......@@ -204,9 +274,9 @@ class Graph(object):
A numpy.ndarray as the given nodes' indegree.
"""
if nodes is None:
return self._adj_dst_index.degree
return self.adj_dst_index.degree
else:
return self._adj_dst_index.degree[nodes]
return self.adj_dst_index.degree[nodes]
def outdegree(self, nodes=None):
"""Return the outdegree of the given nodes.
......@@ -221,9 +291,9 @@ class Graph(object):
A numpy.array as the given nodes' outdegree.
"""
if nodes is None:
return self._adj_src_index.degree
return self.adj_src_index.degree
else:
return self._adj_src_index.degree[nodes]
return self.adj_src_index.degree[nodes]
def successor(self, nodes=None, return_eids=False):
"""Find successor of given nodes.
......@@ -271,19 +341,17 @@ class Graph(object):
[]]
"""
if nodes is None:
if return_eids:
return self._adj_src_index.v, self._adj_src_index.eid
else:
return self._adj_src_index.v
if return_eids:
return self.adj_src_index.view_v(
nodes), self.adj_src_index.view_eid(nodes)
else:
if return_eids:
return self._adj_src_index.v[nodes], self._adj_src_index.eid[
nodes]
else:
return self._adj_src_index.v[nodes]
return self.adj_src_index.view_v(nodes)
def sample_successor(self, nodes, max_degree, return_eids=False):
def sample_successor(self,
nodes,
max_degree,
return_eids=False,
shuffle=False):
"""Sample successors of given nodes.
Args:
......@@ -304,26 +372,20 @@ class Graph(object):
node_succ = self.successor(nodes, return_eids=return_eids)
if return_eids:
node_succ, node_succ_eid = node_succ
if nodes is None:
nodes = self.nodes
sample_succ, sample_succ_eid = [], []
for i in range(len(nodes)):
max_size = min(max_degree, len(node_succ[i]))
if max_size == 0:
sample_succ.append([])
if return_eids:
sample_succ_eid.append([])
else:
ind = np.random.choice(
len(node_succ[i]), max_size, replace=False)
sample_succ.append(node_succ[i][ind])
if return_eids:
sample_succ_eid.append(node_succ_eid[i][ind])
node_succ = node_succ.tolist()
if return_eids:
node_succ_eid = node_succ_eid.tolist()
if return_eids:
return sample_succ, sample_succ_eid
return graph_kernel.sample_subset_with_eid(
node_succ, node_succ_eid, max_degree, shuffle)
else:
return sample_succ
return graph_kernel.sample_subset(node_succ, max_degree, shuffle)
def predecessor(self, nodes=None, return_eids=False):
"""Find predecessor of given nodes.
......@@ -371,19 +433,17 @@ class Graph(object):
[2]]
"""
if nodes is None:
if return_eids:
return self._adj_dst_index.v, self._adj_dst_index.eid
else:
return self._adj_dst_index.v
if return_eids:
return self.adj_dst_index.view_v(
nodes), self.adj_dst_index.view_eid(nodes)
else:
if return_eids:
return self._adj_dst_index.v[nodes], self._adj_dst_index.eid[
nodes]
else:
return self._adj_dst_index.v[nodes]
return self.adj_dst_index.view_v(nodes)
def sample_predecessor(self, nodes, max_degree, return_eids=False):
def sample_predecessor(self,
nodes,
max_degree,
return_eids=False,
shuffle=False):
"""Sample predecessor of given nodes.
Args:
......@@ -407,24 +467,16 @@ class Graph(object):
if nodes is None:
nodes = self.nodes
sample_pred, sample_pred_eid = [], []
for i in range(len(nodes)):
max_size = min(max_degree, len(node_pred[i]))
if max_size == 0:
sample_pred.append([])
if return_eids:
sample_pred_eid.append([])
else:
ind = np.random.choice(
len(node_pred[i]), max_size, replace=False)
sample_pred.append(node_pred[i][ind])
if return_eids:
sample_pred_eid.append(node_pred_eid[i][ind])
node_pred = node_pred.tolist()
if return_eids:
node_pred_eid = node_pred_eid.tolist()
if return_eids:
return sample_pred, sample_pred_eid
return graph_kernel.sample_subset_with_eid(
node_pred, node_pred_eid, max_degree, shuffle)
else:
return sample_pred
return graph_kernel.sample_subset(node_pred, max_degree, shuffle)
def node_feat_info(self):
"""Return the information of node feature for GraphWrapper.
......@@ -500,19 +552,31 @@ class Graph(object):
(key, _hide_num_nodes(value.shape), value.dtype))
return edge_feat_info
def subgraph(self, nodes, eid):
def subgraph(self,
nodes,
eid=None,
edges=None,
edge_feats=None,
with_node_feat=True,
with_edge_feat=True):
"""Generate subgraph with nodes and edge ids.
This function will generate a :code:`pgl.graph.Subgraph` object and
copy all corresponding node and edge features. Nodes and edges will
be reindex from 0.
be reindex from 0. Eid and edges can't both be None.
WARNING: ALL NODES IN EID MUST BE INCLUDED BY NODES
Args:
nodes: Node ids which will be included in the subgraph.
eid: Edge ids which will be included in the subgraph.
eid (optional): Edge ids which will be included in the subgraph.
edges (optional): Edge(src, dst) list which will be included in the subgraph.
with_node_feat: Whether to inherit node features from parent graph.
with_edge_feat: Whether to inherit edge features from parent graph.
Return:
A :code:`pgl.graph.Subgraph` object.
......@@ -522,16 +586,33 @@ class Graph(object):
for ind, node in enumerate(nodes):
reindex[node] = ind
eid = np.array(eid, dtype="int32")
sub_edges = graph_kernel.map_edges(eid, self._edges, reindex)
if eid is None and edges is None:
raise ValueError("Eid and edges can't be None at the same time.")
if edges is None:
edges = self._edges[eid]
else:
edges = np.array(edges, dtype="int64")
sub_edges = graph_kernel.map_edges(
np.arange(
len(edges), dtype="int64"), edges, reindex)
sub_edge_feat = {}
for key, value in self._edge_feat.items():
sub_edge_feat[key] = value[eid]
if edges is None:
if with_edge_feat:
for key, value in self._edge_feat.items():
if eid is None:
raise ValueError(
"Eid can not be None with edge features.")
sub_edge_feat[key] = value[eid]
else:
sub_edge_feat = edge_feats
sub_node_feat = {}
for key, value in self._node_feat.items():
sub_node_feat[key] = value[nodes]
if with_node_feat:
for key, value in self._node_feat.items():
sub_node_feat[key] = value[nodes]
subgraph = SubGraph(
num_nodes=len(nodes),
......@@ -554,7 +635,7 @@ class Graph(object):
Return:
Batch iterator
"""
perm = np.arange(self._num_nodes, dtype="int32")
perm = np.arange(self._num_nodes, dtype="int64")
if shuffle:
np.random.shuffle(perm)
start = 0
......@@ -644,7 +725,7 @@ class Graph(object):
break
succ = self.successor(cur_nodes)
sample_index = np.floor(
np.random.rand(outdegree.shape[0]) * outdegree).astype("int32")
np.random.rand(outdegree.shape[0]) * outdegree).astype("int64")
nxt_cur_nodes = []
for s, ind, walk_id in zip(succ, sample_index, cur_walk_ids):
......@@ -677,8 +758,8 @@ class Graph(object):
cur_walk_ids = np.arange(0, len(nodes))
cur_nodes = np.array(nodes)
prev_nodes = np.array([-1] * len(nodes), dtype="int32")
prev_succs = np.array([[]] * len(nodes), dtype="int32")
prev_nodes = np.array([-1] * len(nodes), dtype="int64")
prev_succs = np.array([[]] * len(nodes), dtype="int64")
for l in range(max_depth):
# select the walks not end
outdegree = self.outdegree(cur_nodes)
......@@ -693,7 +774,7 @@ class Graph(object):
break
cur_succs = self.successor(cur_nodes)
num_nodes = cur_nodes.shape[0]
nxt_nodes = np.zeros(num_nodes, dtype="int32")
nxt_nodes = np.zeros(num_nodes, dtype="int64")
for idx, (succ, prev_succ, walk_id, prev_node) in enumerate(
zip(cur_succs, prev_succs, cur_walk_ids, prev_nodes)):
......@@ -707,6 +788,16 @@ class Graph(object):
cur_nodes = nxt_nodes
return walk
@property
def num_graph(self):
""" Return Number of Graphs"""
return self._num_graph
@property
def graph_lod(self):
""" Return Graph Lod Index for Paddle Computation"""
return self._graph_lod
class SubGraph(Graph):
"""Implementation of SubGraph in pgl.
......@@ -760,3 +851,120 @@ class SubGraph(Graph):
A list of node ids in parent graph.
"""
return graph_kernel.map_nodes(nodes, self._to_reindex)
class MultiGraph(Graph):
"""Implementation of multiple disjoint graph structure in pgl.
This is a simple implementation of graph structure in pgl.
Args:
graph_list : A list of Graph Instances
Examples:
.. code-block:: python
batch_graph = MultiGraph([graph1, graph2, graph3])
"""
def __init__(self, graph_list):
num_nodes = np.sum([g.num_nodes for g in graph_list])
node_feat = self._join_node_feature(graph_list)
edge_feat = self._join_edge_feature(graph_list)
edges = self._join_edges(graph_list)
super(MultiGraph, self).__init__(
num_nodes=num_nodes,
edges=edges,
node_feat=node_feat,
edge_feat=edge_feat)
self._num_graph = len(graph_list)
self._src_graph = graph_list
graph_lod = [g.num_nodes for g in graph_list]
graph_lod = np.cumsum(graph_lod, dtype="int32")
graph_lod = np.insert(graph_lod, 0, 0)
self._graph_lod = graph_lod
def __getitem__(self, index):
return self._src_graph[index]
def _join_node_feature(self, graph_list):
"""join node features for multiple graph"""
node_feat = defaultdict(lambda: [])
for graph in graph_list:
for key in graph.node_feat:
node_feat[key].append(graph.node_feat[key])
ret_node_feat = {}
for key in node_feat:
ret_node_feat[key] = np.vstack(node_feat[key])
return ret_node_feat
def _join_edge_feature(self, graph_list):
"""join edge features for multiple graph"""
edge_feat = defaultdict(lambda: [])
for graph in graph_list:
for key in graph.edge_feat:
efeat = graph.edge_feat[key]
if len(efeat) > 0:
edge_feat[key].append(efeat)
ret_edge_feat = {}
for key in edge_feat:
ret_edge_feat[key] = np.vstack(edge_feat[key])
return ret_edge_feat
def _join_edges(self, graph_list):
"""join edges for multiple graph"""
list_edges = []
start_offset = 0
for graph in graph_list:
edges = graph.edges
if len(edges) > 0:
edges = edges + start_offset
list_edges.append(edges)
start_offset += graph.num_nodes
edges = np.vstack(list_edges)
return edges
class MemmapEdgeIndex(EdgeIndex):
def __init__(self, path):
self._degree = np.load(os.path.join(path, 'degree.npy'), mmap_mode="r")
self._sorted_u = np.load(
os.path.join(path, 'sorted_u.npy'), mmap_mode="r")
self._sorted_v = np.load(
os.path.join(path, 'sorted_v.npy'), mmap_mode="r")
self._sorted_eid = np.load(
os.path.join(path, 'sorted_eid.npy'), mmap_mode="r")
self._indptr = np.load(os.path.join(path, 'indptr.npy'), mmap_mode="r")
class MemmapGraph(Graph):
def __init__(self, path):
self._num_nodes = np.load(os.path.join(path, 'num_nodes.npy'))
self._edges = np.load(os.path.join(path, 'edges.npy'), mmap_mode="r")
if os.path.isdir(os.path.join(path, 'adj_src')):
self._adj_src_index = MemmapEdgeIndex(
os.path.join(path, 'adj_src'))
else:
self._adj_src_index = None
if os.path.isdir(os.path.join(path, 'adj_dst')):
self._adj_dst_index = MemmapEdgeIndex(
os.path.join(path, 'adj_dst'))
else:
self._adj_dst_index = None
def load_feat(feat_path):
"""Load features from .npy file.
"""
feat = {}
if os.path.isdir(feat_path):
for feat_name in os.listdir(feat_path):
feat[os.path.splitext(feat_name)[0]] = np.load(
os.path.join(feat_path, feat_name), mmap_mode="r")
return feat
self._node_feat = load_feat(os.path.join(path, 'node_feat'))
self._edge_feat = load_feat(os.path.join(path, 'edge_feat'))
......@@ -26,20 +26,20 @@ from libc.stdlib cimport rand, RAND_MAX
@cython.boundscheck(False)
@cython.wraparound(False)
def build_index(np.ndarray[np.int32_t, ndim=1] u,
np.ndarray[np.int32_t, ndim=1] v,
int num_nodes):
def build_index(np.ndarray[np.int64_t, ndim=1] u,
np.ndarray[np.int64_t, ndim=1] v,
long long num_nodes):
"""Building Edge Index
"""
cdef int i
cdef int h=len(u)
cdef int n_size = num_nodes
cdef np.ndarray[np.int32_t, ndim=1] degree = np.zeros([n_size], dtype=np.int32)
cdef np.ndarray[np.int32_t, ndim=1] count = np.zeros([n_size], dtype=np.int32)
cdef np.ndarray[np.int32_t, ndim=1] _tmp_v = np.zeros([h], dtype=np.int32)
cdef np.ndarray[np.int32_t, ndim=1] _tmp_u = np.zeros([h], dtype=np.int32)
cdef np.ndarray[np.int32_t, ndim=1] _tmp_eid = np.zeros([h], dtype=np.int32)
cdef np.ndarray[np.int32_t, ndim=1] indptr = np.zeros([n_size + 1], dtype=np.int32)
cdef long long i
cdef long long h=len(u)
cdef long long n_size = num_nodes
cdef np.ndarray[np.int64_t, ndim=1] degree = np.zeros([n_size], dtype=np.int64)
cdef np.ndarray[np.int64_t, ndim=1] count = np.zeros([n_size], dtype=np.int64)
cdef np.ndarray[np.int64_t, ndim=1] _tmp_v = np.zeros([h], dtype=np.int64)
cdef np.ndarray[np.int64_t, ndim=1] _tmp_u = np.zeros([h], dtype=np.int64)
cdef np.ndarray[np.int64_t, ndim=1] _tmp_eid = np.zeros([h], dtype=np.int64)
cdef np.ndarray[np.int64_t, ndim=1] indptr = np.zeros([n_size + 1], dtype=np.int64)
with nogil:
for i in xrange(h):
......@@ -53,27 +53,34 @@ def build_index(np.ndarray[np.int32_t, ndim=1] u,
_tmp_eid[indptr[u[i]] + count[u[i]]] = i
_tmp_u[indptr[u[i]] + count[u[i]]] = u[i]
count[u[i]] += 1
return degree, _tmp_v, _tmp_u, _tmp_eid, indptr
cdef list output_eid = []
cdef list output_v = []
for i in xrange(n_size):
output_eid.append(_tmp_eid[indptr[i]:indptr[i+1]])
output_v.append(_tmp_v[indptr[i]:indptr[i+1]])
return np.array(output_v), np.array(output_eid), degree, _tmp_u, _tmp_v, _tmp_eid
@cython.boundscheck(False)
@cython.wraparound(False)
def slice_by_index(np.ndarray[np.int64_t, ndim=1] u,
np.ndarray[np.int64_t, ndim=1] indptr,
np.ndarray[np.int64_t, ndim=1] index):
cdef list output = []
cdef long long i
cdef long long h = len(index)
cdef long long j
for i in xrange(h):
j = index[i]
output.append(u[indptr[j]:indptr[j+1]])
return np.array(output)
@cython.boundscheck(False)
@cython.wraparound(False)
def map_edges(np.ndarray[np.int32_t, ndim=1] eid,
np.ndarray[np.int32_t, ndim=2] edges,
def map_edges(np.ndarray[np.int64_t, ndim=1] eid,
np.ndarray[np.int64_t, ndim=2] edges,
reindex):
"""Mapping edges by given dictionary
"""
cdef unordered_map[int, int] m = reindex
cdef int i = 0
cdef int h = len(eid)
cdef np.ndarray[np.int32_t, ndim=2] r_edges = np.zeros([h, 2], dtype=np.int32)
cdef int j
cdef unordered_map[long long, long long] m = reindex
cdef long long i = 0
cdef long long h = len(eid)
cdef np.ndarray[np.int64_t, ndim=2] r_edges = np.zeros([h, 2], dtype=np.int64)
cdef long long j
with nogil:
for i in xrange(h):
j = eid[i]
......@@ -86,31 +93,33 @@ def map_edges(np.ndarray[np.int32_t, ndim=1] eid,
def map_nodes(nodes, reindex):
"""Mapping nodes by given dictionary
"""
cdef unordered_map[int, int] m = reindex
cdef int i = 0
cdef int h = len(nodes)
cdef np.ndarray[np.int32_t, ndim=1] new_nodes = np.zeros([h], dtype=np.int32)
cdef int j
for i in xrange(h):
j = nodes[i]
new_nodes[i] = m[j]
cdef np.ndarray[np.int64_t, ndim=1] t_nodes = np.array(nodes, dtype=np.int64)
cdef unordered_map[long long, long long] m = reindex
cdef long long i = 0
cdef long long h = len(nodes)
cdef np.ndarray[np.int64_t, ndim=1] new_nodes = np.zeros([h], dtype=np.int64)
cdef long long j
with nogil:
for i in xrange(h):
j = t_nodes[i]
new_nodes[i] = m[j]
return new_nodes
@cython.boundscheck(False)
@cython.wraparound(False)
def node2vec_sample(np.ndarray[np.int32_t, ndim=1] succ,
np.ndarray[np.int32_t, ndim=1] prev_succ, int prev_node,
def node2vec_sample(np.ndarray[np.int64_t, ndim=1] succ,
np.ndarray[np.int64_t, ndim=1] prev_succ, long long prev_node,
float p, float q):
"""Fast implement of node2vec sampling
"""
cdef int i
cdef long long i
cdef succ_len = len(succ)
cdef prev_succ_len = len(prev_succ)
cdef vector[float] probs
cdef float prob_sum = 0
cdef unordered_set[int] prev_succ_set
cdef unordered_set[long long] prev_succ_set
for i in xrange(prev_succ_len):
prev_succ_set.insert(prev_succ[i])
......@@ -127,9 +136,188 @@ def node2vec_sample(np.ndarray[np.int32_t, ndim=1] succ,
cdef float rand_num = float(rand())/RAND_MAX * prob_sum
cdef int sample_succ = 0
cdef long long sample_succ = 0
for i in xrange(succ_len):
rand_num -= probs[i]
if rand_num <= 0:
sample_succ = succ[i]
return sample_succ
@cython.boundscheck(False)
@cython.wraparound(False)
def subset_choose_index(long long s_size,
np.ndarray[ndim=1, dtype=np.int64_t] nid,
np.ndarray[ndim=1, dtype=np.int64_t] rnd,
np.ndarray[ndim=1, dtype=np.int64_t] buff_nid,
long long offset):
cdef long long n_size = len(nid)
cdef long long i
cdef long long j
cdef unordered_map[long long, long long] m
with nogil:
for i in xrange(s_size):
j = rnd[offset + i] % n_size
if j >= i:
buff_nid[offset + i] = nid[j] if m.find(j) == m.end() else nid[m[j]]
m[j] = i if m.find(i) == m.end() else m[i]
else:
buff_nid[offset + i] = buff_nid[offset + j]
buff_nid[offset + j] = nid[i] if m.find(i) == m.end() else nid[m[i]]
@cython.boundscheck(False)
@cython.wraparound(False)
def subset_choose_index_eid(long long s_size,
np.ndarray[ndim=1, dtype=np.int64_t] nid,
np.ndarray[ndim=1, dtype=np.int64_t] eid,
np.ndarray[ndim=1, dtype=np.int64_t] rnd,
np.ndarray[ndim=1, dtype=np.int64_t] buff_nid,
np.ndarray[ndim=1, dtype=np.int64_t] buff_eid,
long long offset):
cdef long long n_size = len(nid)
cdef long long i
cdef long long j
cdef unordered_map[long long, long long] m
with nogil:
for i in xrange(s_size):
j = rnd[offset + i] % n_size
if j >= i:
if m.find(j) == m.end():
buff_nid[offset + i], buff_eid[offset + i] = nid[j], eid[j]
else:
buff_nid[offset + i], buff_eid[offset + i] = nid[m[j]], eid[m[j]]
m[j] = i if m.find(i) == m.end() else m[i]
else:
buff_nid[offset + i], buff_eid[offset + i] = buff_nid[offset + j], buff_eid[offset + j]
if m.find(i) == m.end():
buff_nid[offset + j], buff_eid[offset + j] = nid[i], eid[i]
else:
buff_nid[offset + j], buff_eid[offset + j] = nid[m[i]], eid[m[i]]
@cython.boundscheck(False)
@cython.wraparound(False)
def sample_subset(list nids, long long maxdegree, shuffle=False):
cdef np.ndarray[ndim=1, dtype=np.int64_t] buff_index
cdef long long buff_size, sample_size
cdef long long total_buff_size = 0
cdef long long inc = 0
cdef list output = []
for inc in xrange(len(nids)):
buff_size = len(nids[inc])
if buff_size > maxdegree:
total_buff_size += maxdegree
elif shuffle:
total_buff_size += buff_size
cdef np.ndarray[ndim=1, dtype=np.int64_t] buff_nid = np.zeros([total_buff_size], dtype=np.int64)
cdef np.ndarray[np.int64_t, ndim=1] rnd = np.random.randint(0, np.iinfo(np.int64).max,
dtype=np.int64, size=total_buff_size)
cdef long long offset = 0
for inc in xrange(len(nids)):
buff_size = len(nids[inc])
if not shuffle and buff_size <= maxdegree:
output.append(nids[inc])
else:
sample_size = buff_size if buff_size <= maxdegree else maxdegree
if isinstance(nids[inc], list):
tmp = np.array(nids[inc], dtype=np.int64)
else:
tmp = nids[inc]
subset_choose_index(sample_size, tmp, rnd, buff_nid, offset)
output.append(buff_nid[offset:offset+sample_size])
offset += sample_size
return output
@cython.boundscheck(False)
@cython.wraparound(False)
def sample_subset_with_eid(list nids, list eids, long long maxdegree, shuffle=False):
cdef np.ndarray[ndim=1, dtype=np.int64_t] buff_index
cdef long long buff_size, sample_size
cdef long long total_buff_size = 0
cdef long long inc = 0
cdef list output = []
cdef list output_eid = []
for inc in xrange(len(nids)):
buff_size = len(nids[inc])
if buff_size > maxdegree:
total_buff_size += maxdegree
elif shuffle:
total_buff_size += buff_size
cdef np.ndarray[ndim=1, dtype=np.int64_t] buff_nid = np.zeros([total_buff_size], dtype=np.int64)
cdef np.ndarray[ndim=1, dtype=np.int64_t] buff_eid = np.zeros([total_buff_size], dtype=np.int64)
cdef np.ndarray[np.int64_t, ndim=1] rnd = np.random.randint(0, np.iinfo(np.int64).max,
dtype=np.int64, size=total_buff_size)
cdef long long offset = 0
for inc in xrange(len(nids)):
buff_size = len(nids[inc])
if not shuffle and buff_size <= maxdegree:
output.append(nids[inc])
output_eid.append(eids[inc])
else:
sample_size = buff_size if buff_size <= maxdegree else maxdegree
if isinstance(nids[inc], list):
tmp = np.array(nids[inc], dtype=np.int64)
tmp_eids = np.array(eids[inc], dtype=np.int64)
else:
tmp = nids[inc]
tmp_eids = eids[inc]
subset_choose_index_eid(sample_size, tmp, tmp_eids, rnd, buff_nid, buff_eid, offset)
output.append(buff_nid[offset:offset+sample_size])
output_eid.append(buff_eid[offset:offset+sample_size])
offset += sample_size
return output, output_eid
@cython.boundscheck(False)
@cython.wraparound(False)
def skip_gram_gen_pair(vector[long long] walk, long win_size=5):
cdef vector[long long] src
cdef vector[long long] dst
cdef long long l = len(walk)
cdef long long real_win_size, left, right, i
cdef np.ndarray[np.int64_t, ndim=1] rnd = np.random.randint(1, win_size+1,
dtype=np.int64, size=l)
with nogil:
for i in xrange(l):
real_win_size = rnd[i]
left = i - real_win_size
if left < 0:
left = 0
right = i + real_win_size
if right >= l:
right = l - 1
for j in xrange(left, right+1):
if walk[i] == walk[j]:
continue
src.push_back(walk[i])
dst.push_back(walk[j])
return src, dst
@cython.boundscheck(False)
@cython.wraparound(False)
def alias_sample_build_table(np.ndarray[np.float64_t, ndim=1] probs):
cdef long long l = len(probs)
cdef np.ndarray[np.float64_t, ndim=1] alias = probs * l
cdef np.ndarray[np.int64_t, ndim=1] events = np.zeros(l, dtype=np.int64)
cdef vector[long long] larger_num, smaller_num
cdef long long i, s_i, l_i
with nogil:
for i in xrange(l):
if alias[i] > 1:
larger_num.push_back(i)
elif alias[i] < 1:
smaller_num.push_back(i)
while smaller_num.size() > 0 and larger_num.size() > 0:
s_i = smaller_num.back()
l_i = larger_num.back()
smaller_num.pop_back()
events[s_i] = l_i
alias[l_i] -= (1 - alias[s_i])
if alias[l_i] <= 1:
larger_num.pop_back()
if alias[l_i] < 1:
smaller_num.push_back(l_i)
return alias, events
......@@ -36,19 +36,22 @@ def send(src, dst, nfeat, efeat, message_func):
return msg
def recv(dst, uniq_dst, bucketing_index, msg, reduce_function, node_ids):
def recv(dst, uniq_dst, bucketing_index, msg, reduce_function, num_nodes,
num_edges):
"""Recv message from given msg to dst nodes.
"""
empty_msg_flag = fluid.layers.cast(num_edges > 0, dtype="float32")
if reduce_function == "sum":
if isinstance(msg, dict):
raise TypeError("The message for build-in function"
" should be Tensor not dict.")
try:
out_dims = msg.shape[-1]
init_output = fluid.layers.fill_constant_batch_size_like(
node_ids, shape=[1, out_dims], value=0, dtype="float32")
out_dim = msg.shape[-1]
init_output = fluid.layers.fill_constant(
shape=[num_nodes, out_dim], value=0, dtype="float32")
init_output.stop_gradient = False
msg = msg * empty_msg_flag
output = paddle_helper.scatter_add(init_output, dst, msg)
return output
except TypeError as e:
......@@ -60,17 +63,16 @@ def recv(dst, uniq_dst, bucketing_index, msg, reduce_function, node_ids):
reduce_function = sum_func
# convert msg into lodtensor
bucketed_msg = op.nested_lod_reset(msg, bucketing_index)
# Check dim for bucketed_msg equal to out_dims
output = reduce_function(bucketed_msg)
out_dims = output.shape[-1]
output_dim = output.shape[-1]
output = output * empty_msg_flag
init_output = fluid.layers.fill_constant_batch_size_like(
node_ids, shape=[1, out_dims], value=0, dtype="float32")
init_output.stop_gradient = False
output = fluid.layers.scatter(init_output, uniq_dst, output)
return output
init_output = fluid.layers.fill_constant(
shape=[num_nodes, output_dim], value=0, dtype="float32")
init_output.stop_gradient = True
final_output = fluid.layers.scatter(init_output, uniq_dst, output)
return final_output
class BaseGraphWrapper(object):
......@@ -89,16 +91,21 @@ class BaseGraphWrapper(object):
"""
def __init__(self):
self._node_feat_tensor_dict = {}
self._edge_feat_tensor_dict = {}
self.node_feat_tensor_dict = {}
self.edge_feat_tensor_dict = {}
self._edges_src = None
self._edges_dst = None
self._num_nodes = None
self._indegree = None
self._edge_uniq_dst = None
self._edge_uniq_dst_count = None
self._bucketing_index = None
self._node_ids = None
self._graph_lod = None
self._num_graph = None
self._data_name_prefix = ""
def __repr__(self):
return self._data_name_prefix
def send(self, message_func, nfeat_list=None, efeat_list=None):
"""Send message from all src nodes to dst nodes.
......@@ -188,10 +195,11 @@ class BaseGraphWrapper(object):
output = recv(
dst=self._edges_dst,
uniq_dst=self._edge_uniq_dst,
bucketing_index=self._bucketing_index,
bucketing_index=self._edge_uniq_dst_count,
msg=msg,
reduce_function=reduce_function,
node_ids=self._node_ids)
num_edges=self._num_edges,
num_nodes=self._num_nodes)
return output
@property
......@@ -200,7 +208,7 @@ class BaseGraphWrapper(object):
Return:
A tuple of Tensor (src, dst). Src and dst are both
tensor with shape (num_edges, ) and dtype int32.
tensor with shape (num_edges, ) and dtype int64.
"""
return self._edges_src, self._edges_dst
......@@ -209,10 +217,28 @@ class BaseGraphWrapper(object):
"""Return a variable of number of nodes
Return:
A variable with shape (1,) as the number of nodes in int32.
A variable with shape (1,) as the number of nodes in int64.
"""
return self._num_nodes
@property
def graph_lod(self):
"""Return graph index for graphs
Return:
A variable with shape [None ] as the Lod information of multiple-graph.
"""
return self._graph_lod
@property
def num_graph(self):
"""Return a variable of number of graphs
Return:
A variable with shape (1,) as the number of Graphs in int64.
"""
return self._num_graph
@property
def edge_feat(self):
"""Return a dictionary of tensor representing edge features.
......@@ -221,7 +247,7 @@ class BaseGraphWrapper(object):
A dictionary whose keys are the feature names and the values
are feature tensor.
"""
return self._edge_feat_tensor_dict
return self.edge_feat_tensor_dict
@property
def node_feat(self):
......@@ -231,13 +257,13 @@ class BaseGraphWrapper(object):
A dictionary whose keys are the feature names and the values
are feature tensor.
"""
return self._node_feat_tensor_dict
return self.node_feat_tensor_dict
def indegree(self):
"""Return the indegree tensor for all nodes.
Return:
A tensor of shape (num_nodes, ) in int32.
A tensor of shape (num_nodes, ) in int64.
"""
return self._indegree
......@@ -252,7 +278,7 @@ class StaticGraphWrapper(BaseGraphWrapper):
graph: The static graph that should be put into memory
place: fluid.CPUPlace or fluid.GPUPlace(n) indicating the
place: fluid.CPUPlace or fluid.CUDAPlace(n) indicating the
device to hold the graph data.
Examples:
......@@ -299,19 +325,31 @@ class StaticGraphWrapper(BaseGraphWrapper):
def __init__(self, name, graph, place):
super(StaticGraphWrapper, self).__init__()
self._data_name_prefix = name
self._initializers = []
self.__data_name_prefix = name
self.__create_graph_attr(graph)
def __create_graph_attr(self, graph):
"""Create graph attributes for paddlepaddle.
"""
src, dst = list(zip(*graph.edges))
src, dst, eid = graph.sorted_edges(sort_by="dst")
indegree = graph.indegree()
nodes = graph.nodes
uniq_dst = nodes[indegree > 0]
uniq_dst_count = indegree[indegree > 0]
uniq_dst_count = np.cumsum(uniq_dst_count, dtype='int32')
uniq_dst_count = np.insert(uniq_dst_count, 0, 0)
graph_lod = graph.graph_lod
num_graph = graph.num_graph
num_edges = len(src)
if num_edges == 0:
# Fake Graph
src = np.array([0], dtype="int64")
dst = np.array([0], dtype="int64")
eid = np.array([0], dtype="int64")
uniq_dst_count = np.array([0, 1], dtype="int32")
uniq_dst = np.array([0], dtype="int64")
edge_feat = {}
......@@ -322,57 +360,67 @@ class StaticGraphWrapper(BaseGraphWrapper):
self.__create_graph_node_feat(node_feat, self._initializers)
self.__create_graph_edge_feat(edge_feat, self._initializers)
self._num_edges, init = paddle_helper.constant(
dtype="int64",
value=np.array(
[num_edges], dtype="int64"),
name=self._data_name_prefix + '/num_edges')
self._initializers.append(init)
self._num_graph, init = paddle_helper.constant(
dtype="int64",
value=np.array(
[num_graph], dtype="int64"),
name=self._data_name_prefix + '/num_graph')
self._initializers.append(init)
self._edges_src, init = paddle_helper.constant(
dtype="int32",
dtype="int64",
value=src,
name=self.__data_name_prefix + '_edges_src')
name=self._data_name_prefix + '/edges_src')
self._initializers.append(init)
self._edges_dst, init = paddle_helper.constant(
dtype="int32",
dtype="int64",
value=dst,
name=self.__data_name_prefix + '_edges_dst')
name=self._data_name_prefix + '/edges_dst')
self._initializers.append(init)
self._num_nodes, init = paddle_helper.constant(
dtype="int32",
dtype="int64",
hide_batch_size=False,
value=np.array([graph.num_nodes]),
name=self.__data_name_prefix + '_num_nodes')
name=self._data_name_prefix + '/num_nodes')
self._initializers.append(init)
self._edge_uniq_dst, init = paddle_helper.constant(
name=self.__data_name_prefix + "_uniq_dst",
dtype="int32",
name=self._data_name_prefix + "/uniq_dst",
dtype="int64",
value=uniq_dst)
self._initializers.append(init)
self._edge_uniq_dst_count, init = paddle_helper.constant(
name=self.__data_name_prefix + "_uniq_dst_count",
name=self._data_name_prefix + "/uniq_dst_count",
dtype="int32",
value=uniq_dst_count)
self._initializers.append(init)
bucket_value = np.expand_dims(
np.arange(
0, len(dst), dtype="int32"), -1)
self._bucketing_index, init = paddle_helper.lod_constant(
name=self.__data_name_prefix + "_bucketing_index",
self._graph_lod, init = paddle_helper.constant(
name=self._data_name_prefix + "/graph_lod",
dtype="int32",
lod=list(uniq_dst_count),
value=bucket_value)
value=graph_lod)
self._initializers.append(init)
node_ids_value = np.arange(0, graph.num_nodes, dtype="int32")
node_ids_value = np.arange(0, graph.num_nodes, dtype="int64")
self._node_ids, init = paddle_helper.constant(
name=self.__data_name_prefix + "_node_ids",
dtype="int32",
name=self._data_name_prefix + "/node_ids",
dtype="int64",
value=node_ids_value)
self._initializers.append(init)
self._indegree, init = paddle_helper.constant(
name=self.__data_name_prefix + "_indegree",
dtype="int32",
name=self._data_name_prefix + "/indegree",
dtype="int64",
value=indegree)
self._initializers.append(init)
......@@ -382,9 +430,10 @@ class StaticGraphWrapper(BaseGraphWrapper):
for node_feat_name, node_feat_value in node_feat.items():
node_feat_shape = node_feat_value.shape
node_feat_dtype = node_feat_value.dtype
self._node_feat_tensor_dict[
self.node_feat_tensor_dict[
node_feat_name], init = paddle_helper.constant(
name=self.__data_name_prefix + '_' + node_feat_name,
name=self._data_name_prefix + '/node_feat/' +
node_feat_name,
dtype=node_feat_dtype,
value=node_feat_value)
collector.append(init)
......@@ -395,9 +444,10 @@ class StaticGraphWrapper(BaseGraphWrapper):
for edge_feat_name, edge_feat_value in edge_feat.items():
edge_feat_shape = edge_feat_value.shape
edge_feat_dtype = edge_feat_value.dtype
self._edge_feat_tensor_dict[
self.edge_feat_tensor_dict[
edge_feat_name], init = paddle_helper.constant(
name=self.__data_name_prefix + '_' + edge_feat_name,
name=self._data_name_prefix + '/edge_feat/' +
edge_feat_name,
dtype=edge_feat_dtype,
value=edge_feat_value)
collector.append(init)
......@@ -406,7 +456,7 @@ class StaticGraphWrapper(BaseGraphWrapper):
"""Placing the graph data into the devices.
Args:
place: fluid.CPUPlace or fluid.GPUPlace(n) indicating the
place: fluid.CPUPlace or fluid.CUDAPlace(n) indicating the
device to hold the graph data.
"""
log.info(
......@@ -425,7 +475,7 @@ class GraphWrapper(BaseGraphWrapper):
Args:
name: The graph data prefix
place: fluid.CPUPlace or fluid.GPUPlace(n) indicating the
place: fluid.CPUPlace or fluid.CUDAPlace(n) indicating the
device to hold the graph data.
node_feat: A list of tuples that decribe the details of node
......@@ -483,7 +533,9 @@ class GraphWrapper(BaseGraphWrapper):
def __init__(self, name, place, node_feat=[], edge_feat=[]):
super(GraphWrapper, self).__init__()
self.__data_name_prefix = name
# collect holders for PyReader
self._data_name_prefix = name
self._holder_list = []
self._place = place
self.__create_graph_attr_holders()
for node_feat_name, node_feat_shape, node_feat_dtype in node_feat:
......@@ -497,79 +549,108 @@ class GraphWrapper(BaseGraphWrapper):
def __create_graph_attr_holders(self):
"""Create data holders for graph attributes.
"""
self._num_edges = fluid.layers.data(
self._data_name_prefix + '/num_edges',
shape=[1],
append_batch_size=False,
dtype="int64",
stop_gradient=True)
self._num_graph = fluid.layers.data(
self._data_name_prefix + '/num_graph',
shape=[1],
append_batch_size=False,
dtype="int64",
stop_gradient=True)
self._edges_src = fluid.layers.data(
self.__data_name_prefix + '_edges_src',
self._data_name_prefix + '/edges_src',
shape=[None],
append_batch_size=False,
dtype="int32",
dtype="int64",
stop_gradient=True)
self._edges_dst = fluid.layers.data(
self.__data_name_prefix + '_edges_dst',
self._data_name_prefix + '/edges_dst',
shape=[None],
append_batch_size=False,
dtype="int32",
dtype="int64",
stop_gradient=True)
self._num_nodes = fluid.layers.data(
self.__data_name_prefix + '_num_nodes',
self._data_name_prefix + '/num_nodes',
shape=[1],
append_batch_size=False,
dtype='int32',
dtype='int64',
stop_gradient=True)
self._edge_uniq_dst = fluid.layers.data(
self.__data_name_prefix + "_uniq_dst",
self._data_name_prefix + "/uniq_dst",
shape=[None],
append_batch_size=False,
dtype="int32",
dtype="int64",
stop_gradient=True)
self._edge_uniq_dst_count = fluid.layers.data(
self.__data_name_prefix + "_uniq_dst_count",
self._graph_lod = fluid.layers.data(
self._data_name_prefix + "/graph_lod",
shape=[None],
append_batch_size=False,
dtype="int32",
stop_gradient=True)
self._bucketing_index = fluid.layers.data(
self.__data_name_prefix + "_bucketing_index",
shape=[None, 1],
self._edge_uniq_dst_count = fluid.layers.data(
self._data_name_prefix + "/uniq_dst_count",
shape=[None],
append_batch_size=False,
dtype="int32",
lod_level=1,
stop_gradient=True)
self._node_ids = fluid.layers.data(
self.__data_name_prefix + "_node_ids",
self._data_name_prefix + "/node_ids",
shape=[None],
append_batch_size=False,
dtype="int32",
dtype="int64",
stop_gradient=True)
self._indegree = fluid.layers.data(
self.__data_name_prefix + "_indegree",
self._data_name_prefix + "/indegree",
shape=[None],
append_batch_size=False,
dtype="int32",
dtype="int64",
stop_gradient=True)
self._holder_list.extend([
self._edges_src,
self._edges_dst,
self._num_nodes,
self._edge_uniq_dst,
self._edge_uniq_dst_count,
self._node_ids,
self._indegree,
self._graph_lod,
self._num_graph,
self._num_edges,
])
def __create_graph_node_feat_holders(self, node_feat_name, node_feat_shape,
node_feat_dtype):
"""Create data holders for node features.
"""
feat_holder = fluid.layers.data(
self.__data_name_prefix + '_' + node_feat_name,
self._data_name_prefix + '/node_feat/' + node_feat_name,
shape=node_feat_shape,
append_batch_size=False,
dtype=node_feat_dtype,
stop_gradient=True)
self._node_feat_tensor_dict[node_feat_name] = feat_holder
self.node_feat_tensor_dict[node_feat_name] = feat_holder
self._holder_list.append(feat_holder)
def __create_graph_edge_feat_holders(self, edge_feat_name, edge_feat_shape,
edge_feat_dtype):
"""Create edge holders for edge features.
"""
feat_holder = fluid.layers.data(
self.__data_name_prefix + '_' + edge_feat_name,
self._data_name_prefix + '/edge_feat/' + edge_feat_name,
shape=edge_feat_shape,
append_batch_size=False,
dtype=edge_feat_dtype,
stop_gradient=True)
self._edge_feat_tensor_dict[edge_feat_name] = feat_holder
self.edge_feat_tensor_dict[edge_feat_name] = feat_holder
self._holder_list.append(feat_holder)
def to_feed(self, graph):
"""Convert the graph into feed_dict.
......@@ -588,8 +669,22 @@ class GraphWrapper(BaseGraphWrapper):
src, dst, eid = graph.sorted_edges(sort_by="dst")
indegree = graph.indegree()
nodes = graph.nodes
num_edges = len(src)
uniq_dst = nodes[indegree > 0]
uniq_dst_count = indegree[indegree > 0]
uniq_dst_count = np.cumsum(uniq_dst_count, dtype='int32')
uniq_dst_count = np.insert(uniq_dst_count, 0, 0)
num_graph = graph.num_graph
graph_lod = graph.graph_lod
if num_edges == 0:
# Fake Graph
src = np.array([0], dtype="int64")
dst = np.array([0], dtype="int64")
eid = np.array([0], dtype="int64")
uniq_dst_count = np.array([0, 1], dtype="int32")
uniq_dst = np.array([0], dtype="int64")
edge_feat = {}
......@@ -597,21 +692,33 @@ class GraphWrapper(BaseGraphWrapper):
edge_feat[key] = value[eid]
node_feat = graph.node_feat
feed_dict[self.__data_name_prefix + '_edges_src'] = src
feed_dict[self.__data_name_prefix + '_edges_dst'] = dst
feed_dict[self.__data_name_prefix + '_num_nodes'] = graph.num_nodes
feed_dict[self.__data_name_prefix + '_uniq_dst'] = uniq_dst
feed_dict[self.__data_name_prefix + '_uniq_dst_count'] = uniq_dst_count
feed_dict[self.__data_name_prefix + '_node_ids'] = graph.nodes
feed_dict[self.__data_name_prefix + '_indegree'] = indegree
feed_dict[self.__data_name_prefix + '_bucketing_index'] = \
fluid.create_lod_tensor(np.expand_dims(np.arange(0, len(dst), dtype="int32"), -1),
[list(uniq_dst_count)], self._place)
for key in self._node_feat_tensor_dict:
feed_dict[self.__data_name_prefix + '_' + key] = node_feat[key]
for key in self._edge_feat_tensor_dict:
feed_dict[self.__data_name_prefix + '_' + key] = edge_feat[key]
feed_dict[self._data_name_prefix + '/num_edges'] = np.array(
[num_edges], dtype="int64")
feed_dict[self._data_name_prefix + '/edges_src'] = src
feed_dict[self._data_name_prefix + '/edges_dst'] = dst
feed_dict[self._data_name_prefix + '/num_nodes'] = np.array(
[graph.num_nodes], dtype="int64")
feed_dict[self._data_name_prefix + '/uniq_dst'] = uniq_dst
feed_dict[self._data_name_prefix + '/uniq_dst_count'] = uniq_dst_count
feed_dict[self._data_name_prefix + '/node_ids'] = graph.nodes
feed_dict[self._data_name_prefix + '/indegree'] = indegree
feed_dict[self._data_name_prefix + '/graph_lod'] = graph_lod
feed_dict[self._data_name_prefix + '/num_graph'] = np.array(
[num_graph], dtype="int64")
feed_dict[self._data_name_prefix + '/indegree'] = indegree
for key in self.node_feat_tensor_dict:
feed_dict[self._data_name_prefix + '/node_feat/' +
key] = node_feat[key]
for key in self.edge_feat_tensor_dict:
feed_dict[self._data_name_prefix + '/edge_feat/' +
key] = edge_feat[key]
return feed_dict
@property
def holder_list(self):
"""Return the holder list.
"""
return self._holder_list
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This package implement Heterogeneous Graph structure for handling Heterogeneous graph data.
"""
import time
import numpy as np
import pickle as pkl
import time
import pgl.graph_kernel as graph_kernel
from pgl.graph import Graph
__all__ = ['HeterGraph', 'SubHeterGraph']
def _hide_num_nodes(shape):
"""Set the first dimension as unknown
"""
shape = list(shape)
shape[0] = None
return shape
class HeterGraph(object):
"""Implementation of heterogeneous graph structure in pgl
This is a simple implementation of heterogeneous graph structure in pgl.
Args:
num_nodes: number of nodes in a heterogeneous graph
edges: dict, every element in dict is a list of (u, v) tuples.
node_types (optional): list of (u, node_type) tuples to specify the node type of every node
node_feat (optional): a dict of numpy array as node features
edge_feat (optional): a dict of dict as edge features for every edge type
Examples:
.. code-block:: python
import numpy as np
num_nodes = 4
node_types = [(0, 'user'), (1, 'item'), (2, 'item'), (3, 'user')]
edges = {
'edges_type1': [(0,1), (3,2)],
'edges_type2': [(1,2), (3,1)],
}
node_feat = {'feature': np.random.randn(4, 16)}
edges_feat = {
'edges_type1': {'h': np.random.randn(2, 16)},
'edges_type2': {'h': np.random.randn(2, 16)},
}
g = heter_graph.HeterGraph(
num_nodes=num_nodes,
edges=edges,
node_types=node_types,
node_feat=node_feat,
edge_feat=edges_feat)
"""
def __init__(self,
num_nodes,
edges,
node_types=None,
node_feat=None,
edge_feat=None):
self._num_nodes = num_nodes
self._edges_dict = edges
if isinstance(node_types, list):
self._node_types = np.array(node_types, dtype=object)[:, 1]
else:
self._node_types = node_types
self._nodes_type_dict = {}
for n_type in np.unique(self._node_types):
self._nodes_type_dict[n_type] = np.where(
self._node_types == n_type)[0]
if node_feat is not None:
self._node_feat = node_feat
else:
self._node_feat = {}
if edge_feat is not None:
self._edge_feat = edge_feat
else:
self._edge_feat = {}
self._multi_graph = {}
for key, value in self._edges_dict.items():
if not self._edge_feat:
edge_feat = None
else:
edge_feat = self._edge_feat[key]
self._multi_graph[key] = Graph(
num_nodes=self._num_nodes,
edges=value,
node_feat=self._node_feat,
edge_feat=edge_feat)
self._edge_types = self.edge_types_info()
@property
def edge_types(self):
"""Return a list of edge types.
"""
return self._edge_types
@property
def num_nodes(self):
"""Return the number of nodes.
"""
return self._num_nodes
@property
def num_edges(self):
"""Return edges number of all edge types.
"""
n_edges = {}
for e_type in self._edge_types:
n_edges[e_type] = self._multi_graph[e_type].num_edges
return n_edges
@property
def node_types(self):
"""Return the node types.
"""
return self._node_types
@property
def edge_feat(self, edge_type=None):
"""Return edge features of all edge types.
"""
return self._edge_feat
@property
def node_feat(self):
"""Return a dictionary of node features.
"""
return self._node_feat
@property
def nodes(self):
"""Return all nodes id from 0 to :code:`num_nodes - 1`
"""
return np.arange(self._num_nodes, dtype='int64')
def __getitem__(self, edge_type):
"""__getitem__
"""
return self._multi_graph[edge_type]
def num_nodes_by_type(self, n_type=None):
"""Return the number of nodes with the specified node type.
"""
if n_type not in self._nodes_type_dict:
raise ("%s is not in valid node type" % n_type)
else:
return len(self._nodes_type_dict[n_type])
def indegree(self, nodes=None, edge_type=None):
"""Return the indegree of the given nodes with the specified edge_type.
Args:
nodes: Return the indegree of given nodes.
if nodes is None, return indegree for all nodes.
edge_types: Return the indegree with specified edge_type.
if edge_type is None, return the total indegree of the given nodes.
Return:
A numpy.ndarray as the given nodes' indegree.
"""
if edge_type is None:
indegrees = []
for e_type in self._edge_types:
indegrees.append(self._multi_graph[e_type].indegree(nodes))
indegrees = np.sum(np.vstack(indegrees), axis=0)
return indegrees
else:
return self._multi_graph[edge_type].indegree(nodes)
def outdegree(self, nodes=None, edge_type=None):
"""Return the outdegree of the given nodes with the specified edge_type.
Args:
nodes: Return the outdegree of given nodes,
if nodes is None, return outdegree for all nodes
edge_types: Return the outdegree with specified edge_type.
if edge_type is None, return the total outdegree of the given nodes.
Return:
A numpy.array as the given nodes' outdegree.
"""
if edge_type is None:
outdegrees = []
for e_type in self._edge_types:
outdegrees.append(self._multi_graph[e_type].outdegree(nodes))
outdegrees = np.sum(np.vstack(outdegrees), axis=0)
return outdegrees
else:
return self._multi_graph[edge_type].outdegree(nodes)
def successor(self, edge_type, nodes=None, return_eids=False):
"""Find successor of given nodes with the specified edge_type.
Args:
nodes: Return the successor of given nodes,
if nodes is None, return successor for all nodes
edge_types: Return the successor with specified edge_type.
if edge_type is None, return the total successor of the given nodes
and eids are invalid in this way.
return_eids: If True return nodes together with corresponding eid
"""
return self._multi_graph[edge_type].successor(nodes, return_eids)
def sample_successor(self,
edge_type,
nodes,
max_degree,
return_eids=False,
shuffle=False):
"""Sample successors of given nodes with the specified edge_type.
Args:
edge_type: The specified edge_type.
nodes: Given nodes whose successors will be sampled.
max_degree: The max sampled successors for each nodes.
return_eids: Whether to return the corresponding eids.
Return:
Return a list of numpy.ndarray and each numpy.ndarray represent a list
of sampled successor ids for given nodes with specified edge type.
If :code:`return_eids=True`, there will be an additional list of
numpy.ndarray and each numpy.ndarray represent a list of eids that
connected nodes to their successors.
"""
return self._multi_graph[edge_type].sample_successor(
nodes=nodes,
max_degree=max_degree,
return_eids=return_eids,
shuffle=shuffle)
def predecessor(self, edge_type, nodes=None, return_eids=False):
"""Find predecessor of given nodes with the specified edge_type.
Args:
nodes: Return the predecessor of given nodes,
if nodes is None, return predecessor for all nodes
edge_types: Return the predecessor with specified edge_type.
return_eids: If True return nodes together with corresponding eid
"""
return self._multi_graph[edge_type].predecessor(nodes, return_eids)
def sample_predecessor(self,
edge_type,
nodes,
max_degree,
return_eids=False,
shuffle=False):
"""Sample predecessors of given nodes with the specified edge_type.
Args:
edge_type: The specified edge_type.
nodes: Given nodes whose predecessors will be sampled.
max_degree: The max sampled predecessors for each nodes.
return_eids: Whether to return the corresponding eids.
Return:
Return a list of numpy.ndarray and each numpy.ndarray represent a list
of sampled predecessor ids for given nodes with specified edge type.
If :code:`return_eids=True`, there will be an additional list of
numpy.ndarray and each numpy.ndarray represent a list of eids that
connected nodes to their predecessors.
"""
return self._multi_graph[edge_type].sample_predecessor(
nodes=nodes,
max_degree=max_degree,
return_eids=return_eids,
shuffle=shuffle)
def node_batch_iter(self, batch_size, shuffle=True, n_type=None):
"""Node batch iterator
Iterate all nodes by batch with the specified node type.
Args:
batch_size: The batch size of each batch of nodes.
shuffle: Whether shuffle the nodes.
n_type: Iterate the nodes with the specified node type. If n_type is None,
iterate all nodes by batch.
Return:
Batch iterator
"""
if n_type is None:
nodes = np.arange(self._num_nodes, dtype="int64")
else:
nodes = self._nodes_type_dict[n_type]
if shuffle:
np.random.shuffle(nodes)
start = 0
while start < len(nodes):
yield nodes[start:start + batch_size]
start += batch_size
def sample_nodes(self, sample_num, n_type=None):
"""Sample nodes with the specified n_type from the graph
This function helps to sample nodes with the specified n_type from the graph.
If n_type is None, this function will sample nodes from all nodes.
Nodes might be duplicated.
Args:
sample_num: The number of samples
n_type: The nodes of type to be sampled
Return:
A list of nodes
"""
if n_type is not None:
return np.random.choice(
self._nodes_type_dict[n_type], size=sample_num)
else:
return np.random.randint(
low=0, high=self._num_nodes, size=sample_num)
def node_feat_info(self):
"""Return the information of node feature for HeterGraphWrapper.
This function return the information of node features of all node types. And this
function is used to help constructing HeterGraphWrapper
Return:
A list of tuple (name, shape, dtype) for all given node feature.
"""
node_feat_info = []
for feat_name, feat in self._node_feat.items():
node_feat_info.append(
(feat_name, _hide_num_nodes(feat.shape), feat.dtype))
return node_feat_info
def edge_feat_info(self):
"""Return the information of edge feature for HeterGraphWrapper.
This function return the information of edge features of all edge types. And this
function is used to help constructing HeterGraphWrapper
Return:
A dict of list of tuple (name, shape, dtype) for all given edge feature.
"""
edge_feat_info = {}
for edge_type_name, feat_dict in self._edge_feat.items():
tmp_edge_feat_info = []
for feat_name, feat in feat_dict.items():
full_name = feat_name
tmp_edge_feat_info.append(
(full_name, _hide_num_nodes(feat.shape), feat.dtype))
edge_feat_info[edge_type_name] = tmp_edge_feat_info
return edge_feat_info
def edge_types_info(self):
"""Return the information of all edge types.
Return:
A list of all edge types.
"""
edge_types_info = []
for key, _ in self._edges_dict.items():
edge_types_info.append(key)
return edge_types_info
class SubHeterGraph(HeterGraph):
"""Implementation of SubHeterGraph in pgl.
SubHeterGraph is inherit from :code:`HeterGraph`.
Args:
num_nodes: number of nodes in a heterogeneous graph
edges: dict, every element in dict is a list of (u, v) tuples.
node_types (optional): list of (u, node_type) tuples to specify the node type of every node
node_feat (optional): a dict of numpy array as node features
edge_feat (optional): a dict of dict as edge features for every edge type
reindex: A dictionary that maps parent hetergraph node id to subhetergraph node id.
"""
def __init__(self,
num_nodes,
edges,
node_types=None,
node_feat=None,
edge_feat=None,
reindex=None):
super(SubHeterGraph, self).__init__(
num_nodes=num_nodes,
edges=edges,
node_types=node_types,
node_feat=node_feat,
edge_feat=edge_feat)
if reindex is None:
reindex = {}
self._from_reindex = reindex
self._to_reindex = {u: v for v, u in reindex.items()}
def reindex_from_parrent_nodes(self, nodes):
"""Map the given parent graph node id to subgraph id.
Args:
nodes: A list of nodes from parent graph.
Return:
A list of subgraph ids.
"""
return graph_kernel.map_nodes(nodes, self._from_reindex)
def reindex_to_parrent_nodes(self, nodes):
"""Map the given subgraph node id to parent graph id.
Args:
nodes: A list of nodes in this subgraph.
Return:
A list of node ids in parent graph.
"""
return graph_kernel.map_nodes(nodes, self._to_reindex)
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This package provides interface to help building static computational graph
for PaddlePaddle.
"""
import warnings
import numpy as np
import paddle.fluid as fluid
from pgl.utils import op
from pgl.utils import paddle_helper
from pgl.utils.logger import log
from pgl.graph_wrapper import GraphWrapper
ALL = "__ALL__"
__all__ = ["HeterGraphWrapper"]
def is_all(arg):
"""is_all
"""
return isinstance(arg, str) and arg == ALL
class HeterGraphWrapper(object):
"""Implement a heterogeneous graph wrapper that creates a graph data holders
that attributes and features in the heterogeneous graph.
And we provide interface :code:`to_feed` to help converting :code:`Graph`
data into :code:`feed_dict`.
Args:
name: The heterogeneous graph data prefix
place: fluid.CPUPlace or fluid.CUDAPlace(n) indicating the
device to hold the graph data.
node_feat: A dict of list of tuples that decribe the details of node
feature tenosr. Each tuple mush be (name, shape, dtype)
and the first dimension of the shape must be set unknown
(-1 or None) or we can easily use :code:`HeterGraph.node_feat_info()`
to get the node_feat settings.
edge_feat: A dict of list of tuples that decribe the details of edge
feature tenosr. Each tuple mush be (name, shape, dtype)
and the first dimension of the shape must be set unknown
(-1 or None) or we can easily use :code:`HeterGraph.edge_feat_info()`
to get the edge_feat settings.
Examples:
.. code-block:: python
import paddle.fluid as fluid
import numpy as np
from pgl import heter_graph
from pgl import heter_graph_wrapper
num_nodes = 4
node_types = [(0, 'user'), (1, 'item'), (2, 'item'), (3, 'user')]
edges = {
'edges_type1': [(0,1), (3,2)],
'edges_type2': [(1,2), (3,1)],
}
node_feat = {'feature': np.random.randn(4, 16)}
edges_feat = {
'edges_type1': {'h': np.random.randn(2, 16)},
'edges_type2': {'h': np.random.randn(2, 16)},
}
g = heter_graph.HeterGraph(
num_nodes=num_nodes,
edges=edges,
node_types=node_types,
node_feat=node_feat,
edge_feat=edges_feat)
place = fluid.CPUPlace()
gw = heter_graph_wrapper.HeterGraphWrapper(
name='heter_graph',
place = place,
edge_types = g.edge_types_info(),
node_feat=g.node_feat_info(),
edge_feat=g.edge_feat_info())
"""
def __init__(self, name, place, edge_types, node_feat={}, edge_feat={}):
self.__data_name_prefix = name
self._place = place
self._edge_types = edge_types
self._multi_gw = {}
for edge_type in self._edge_types:
type_name = self.__data_name_prefix + '/' + edge_type
if node_feat:
n_feat = node_feat
else:
n_feat = {}
if edge_feat:
e_feat = edge_feat[edge_type]
else:
e_feat = {}
self._multi_gw[edge_type] = GraphWrapper(
name=type_name,
place=self._place,
node_feat=n_feat,
edge_feat=e_feat)
def to_feed(self, heterGraph, edge_types_list=ALL):
"""Convert the graph into feed_dict.
This function helps to convert graph data into feed dict
for :code:`fluid.Excecutor` to run the model.
Args:
heterGraph: the :code:`HeterGraph` data object
edge_types_list: the edge types list to be fed
Return:
A dictinary contains data holder names and its coresponding data.
"""
multi_graphs = heterGraph._multi_graph
if is_all(edge_types_list):
edge_types_list = self._edge_types
feed_dict = {}
for edge_type in edge_types_list:
feed_d = self._multi_gw[edge_type].to_feed(multi_graphs[edge_type])
feed_dict.update(feed_d)
return feed_dict
def __getitem__(self, edge_type):
"""__getitem__
"""
return self._multi_gw[edge_type]
......@@ -16,6 +16,12 @@
from pgl.layers import conv
from pgl.layers.conv import *
from pgl.layers import set2set
from pgl.layers.set2set import *
from pgl.layers import graph_pool
from pgl.layers.graph_pool import *
__all__ = []
__all__ += conv.__all__
__all__ += set2set.__all__
__all__ += graph_pool.__all__
......@@ -18,7 +18,7 @@ import paddle.fluid as fluid
from pgl import graph_wrapper
from pgl.utils import paddle_helper
__all__ = ['gcn', 'gat']
__all__ = ['gcn', 'gat', 'gin']
def gcn(gw, feature, hidden_size, activation, name, norm=None):
......@@ -53,7 +53,7 @@ def gcn(gw, feature, hidden_size, activation, name, norm=None):
feature = fluid.layers.fc(feature,
size=hidden_size,
bias_attr=False,
name=name)
param_attr=fluid.ParamAttr(name=name))
if norm is not None:
feature = feature * norm
......@@ -67,7 +67,7 @@ def gcn(gw, feature, hidden_size, activation, name, norm=None):
output = fluid.layers.fc(output,
size=hidden_size,
bias_attr=False,
name=name)
param_attr=fluid.ParamAttr(name=name))
if norm is not None:
output = output * norm
......@@ -152,7 +152,7 @@ def gat(gw,
ft = fluid.layers.fc(feature,
hidden_size * num_heads,
bias_attr=False,
name=name + '_weight')
param_attr=fluid.ParamAttr(name=name + '_weight'))
left_a = fluid.layers.create_parameter(
shape=[num_heads, hidden_size],
dtype='float32',
......@@ -178,3 +178,73 @@ def gat(gw,
bias.stop_gradient = True
output = fluid.layers.elementwise_add(output, bias, act=activation)
return output
def gin(gw,
feature,
hidden_size,
activation,
name,
init_eps=0.0,
train_eps=False):
"""Implementation of Graph Isomorphism Network (GIN) layer.
This is an implementation of the paper How Powerful are Graph Neural Networks?
(https://arxiv.org/pdf/1810.00826.pdf).
In their implementation, all MLPs have 2 layers. Batch normalization is applied
on every hidden layer.
Args:
gw: Graph wrapper object (:code:`StaticGraphWrapper` or :code:`GraphWrapper`)
feature: A tensor with shape (num_nodes, feature_size).
name: GIN layer names.
hidden_size: The hidden size for gin.
activation: The activation for the output.
init_eps: float, optional
Initial :math:`\epsilon` value, default is 0.
train_eps: bool, optional
if True, :math:`\epsilon` will be a learnable parameter.
Return:
A tensor with shape (num_nodes, hidden_size).
"""
def send_src_copy(src_feat, dst_feat, edge_feat):
return src_feat["h"]
epsilon = fluid.layers.create_parameter(
shape=[1, 1],
dtype="float32",
attr=fluid.ParamAttr(name="%s_eps" % name),
default_initializer=fluid.initializer.ConstantInitializer(
value=init_eps))
if not train_eps:
epsilon.stop_gradient = True
msg = gw.send(send_src_copy, nfeat_list=[("h", feature)])
output = gw.recv(msg, "sum") + feature * (epsilon + 1.0)
output = fluid.layers.fc(output,
size=hidden_size,
act=None,
param_attr=fluid.ParamAttr(name="%s_w_0" % name),
bias_attr=fluid.ParamAttr(name="%s_b_0" % name))
output = fluid.layers.batch_norm(output)
output = getattr(fluid.layers, activation)(output)
output = fluid.layers.fc(output,
size=hidden_size,
act=activation,
param_attr=fluid.ParamAttr(name="%s_w_1" % name),
bias_attr=fluid.ParamAttr(name="%s_b_1" % name))
return output
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""This package implements common layers to help building
graph neural networks.
"""
import paddle.fluid as fluid
from pgl import graph_wrapper
from pgl.utils import paddle_helper
from pgl.utils import op
__all__ = ['graph_pooling', 'graph_norm']
def graph_pooling(gw, node_feat, pool_type):
"""Implementation of graph pooling
This is an implementation of graph pooling
Args:
gw: Graph wrapper object (:code:`StaticGraphWrapper` or :code:`GraphWrapper`)
node_feat: A tensor with shape (num_nodes, feature_size).
pool_type: The type of pooling ("sum", "average" , "min")
Return:
A tensor with shape (num_graph, hidden_size)
"""
graph_feat = op.nested_lod_reset(node_feat, gw.graph_lod)
graph_feat = fluid.layers.sequence_pool(graph_feat, pool_type)
return graph_feat
def graph_norm(gw, feature):
"""Implementation of graph normalization
Reference Paper: BENCHMARKING GRAPH NEURAL NETWORKS
Each node features is divied by sqrt(num_nodes) per graphs.
Args:
gw: Graph wrapper object (:code:`StaticGraphWrapper` or :code:`GraphWrapper`)
feature: A tensor with shape (num_nodes, hidden_size)
Return:
A tensor with shape (num_nodes, hidden_size)
"""
nodes = fluid.layers.fill_constant(
[gw.num_nodes, 1], dtype="float32", value=1.0)
norm = graph_pooling(gw, nodes, pool_type="sum")
norm = fluid.layers.sqrt(norm)
feature_lod = op.nested_lod_reset(feature, gw.graph_lod)
norm = fluid.layers.sequence_expand_as(norm, feature_lod)
norm.stop_gradient = True
return feature_lod / norm
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""This package implements common layers to help building pooling operators.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import paddle.fluid as F
import paddle.fluid.layers as L
import pgl
__all__ = ['Set2Set']
class Set2Set(object):
"""Implementation of set2set pooling operator.
This is an implementation of the paper ORDER MATTERS: SEQUENCE TO SEQUENCE
FOR SETS (https://arxiv.org/pdf/1511.06391.pdf).
"""
def __init__(self, input_dim, n_iters, n_layers):
"""
Args:
input_dim: hidden size of input data.
n_iters: number of set2set iterations.
n_layers: number of lstm layers.
"""
self.input_dim = input_dim
self.output_dim = 2 * input_dim
self.n_iters = n_iters
# this's set2set n_layers, lstm n_layers = 1
self.n_layers = n_layers
def forward(self, feat):
"""
Args:
feat: input feature with shape [batch, n_edges, dim].
Return:
output_feat: output feature of set2set pooling with shape [batch, 2*dim].
"""
seqlen = 1
h = L.fill_constant_batch_size_like(
feat, [1, self.n_layers, self.input_dim], "float32", 0)
h = L.transpose(h, [1, 0, 2])
c = h
# [seqlen, batch, dim]
q_star = L.fill_constant_batch_size_like(
feat, [1, seqlen, self.output_dim], "float32", 0)
q_star = L.transpose(q_star, [1, 0, 2])
for _ in range(self.n_iters):
# q [seqlen, batch, dim]
# h [layer, batch, dim]
q, h, c = L.lstm(
q_star,
h,
c,
seqlen,
self.input_dim,
self.n_layers,
is_bidirec=False)
# e [batch, seqlen, n_edges]
e = L.matmul(L.transpose(q, [1, 0, 2]), feat, transpose_y=True)
# alpha [batch, seqlen, n_edges]
alpha = L.softmax(e)
# readout [batch, seqlen, dim]
readout = L.matmul(alpha, feat)
readout = L.transpose(readout, [1, 0, 2])
# q_star [seqlen, batch, dim + dim]
q_star = L.concat([q, readout], -1)
return L.squeeze(q_star, [0])
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""redis_graph"""
import pgl
import redis
from redis import BlockingConnectionPool, StrictRedis
from redis._compat import b, unicode, bytes, long, basestring
from rediscluster.nodemanager import NodeManager
from rediscluster.crc import crc16
from collections import OrderedDict
import threading
import numpy as np
import time
import json
import pgl.graph as pgraph
import pickle as pkl
from pgl.utils.logger import log
import pgl.graph_kernel as graph_kernel
def encode(value):
"""
Return a bytestring representation of the value.
This method is copied from Redis' connection.py:Connection.encode
"""
if isinstance(value, bytes):
return value
elif isinstance(value, (int, long)):
value = b(str(value))
elif isinstance(value, float):
value = b(repr(value))
elif not isinstance(value, basestring):
value = unicode(value)
if isinstance(value, unicode):
value = value.encode('utf-8')
return value
def crc16_hash(data):
"""crc16_hash"""
return crc16(encode(data))
LUA_SCRIPT = """
math.randomseed(tonumber(ARGV[1]))
local function permute(tab, count, bucket_size)
local n = #tab / bucket_size
local o_ret = {}
local o_dict = {}
for i = 1, count do
local j = math.random(i, n)
o_ret[i] = string.sub(tab, (i - 1) * bucket_size + 1, i * bucket_size)
if j > count then
if o_dict[j] ~= nil then
o_ret[i], o_dict[j] = o_dict[j], o_ret[i]
else
o_dict[j], o_ret[i] = o_ret[i], string.sub(tab, (j - 1) * bucket_size + 1, j * bucket_size)
end
end
end
return table.concat(o_ret)
end
local bucket_size = 16
local ret = {}
local sample_size = tonumber(ARGV[2])
for i=1, #ARGV - 2 do
local tab = redis.call("HGET", KEYS[1], ARGV[i + 2])
if tab then
if #tab / bucket_size <= sample_size then
ret[i] = tab
else
ret[i] = permute(tab, sample_size, bucket_size)
end
else
ret[i] = tab
end
end
return ret
"""
class RedisCluster(object):
"""RedisCluster"""
def __init__(self, startup_nodes):
self.nodemanager = NodeManager(startup_nodes=startup_nodes)
self.nodemanager.initialize()
self.redis_worker = {}
for node, config in self.nodemanager.nodes.items():
rdp = BlockingConnectionPool(
host=config["host"], port=config["port"])
self.redis_worker[node] = {
"worker": StrictRedis(
connection_pool=rdp, decode_responses=False),
"type": config["server_type"]
}
def get(self, key):
"""get"""
slot = self.nodemanager.keyslot(key)
node = np.random.choice(self.nodemanager.slots[slot])
worker = self.redis_worker[node['name']]
if worker["type"] == "slave":
worker["worker"].execute_command("READONLY")
return worker["worker"].get(key)
def hmget(self, key, fields):
"""hmget"""
while True:
retry = 0
try:
slot = self.nodemanager.keyslot(key)
node = np.random.choice(self.nodemanager.slots[slot])
worker = self.redis_worker[node['name']]
if worker["type"] == "slave":
worker["worker"].execute_command("READONLY")
ret = worker["worker"].hmget(key, fields)
break
except Exception as e:
retry += 1
if retry > 5:
raise e
print("RETRY hmget after 1 sec. Retry Time %s" % retry)
time.sleep(1)
return ret
def hmget_sample(self, key, fields, sample):
"""hmget_sample"""
while True:
retry = 0
try:
slot = self.nodemanager.keyslot(key)
node = np.random.choice(self.nodemanager.slots[slot])
worker = self.redis_worker[node['name']]
if worker["type"] == "slave":
worker["worker"].execute_command("READONLY")
func = worker["worker"].register_script(LUA_SCRIPT)
ret = func(
keys=[key],
args=[np.random.randint(4294967295), sample] + fields)
break
except Exception as e:
retry += 1
if retry > 5:
raise e
print("RETRY hmget_sample after 1 sec. Retry Time %s" % retry)
time.sleep(1)
return ret
def hmget_sample_helper(rs, query, num_parts, sample_size):
"""hmget_sample_helper"""
buff = [b""] * len(query)
part_dict = {}
part_ind_dict = {}
for ind, q in enumerate(query):
part = crc16_hash(q) % num_parts
part = "part-%s" % part
if part not in part_dict:
part_dict[part] = []
part_ind_dict[part] = []
part_dict[part].append(q)
part_ind_dict[part].append(ind)
def worker(_key, _value, _buff, _rs, _part_ind_dict, _sample_size):
"""worker"""
response = _rs.hmget_sample(_key, _value, _sample_size)
for res, ind in zip(response, _part_ind_dict[_key]):
buff[ind] = res
def hmget(_part_dict, _rs, _buff, _part_ind_dict, _sample_size):
"""hmget"""
key_value = list(_part_dict.items())
np.random.shuffle(key_value)
for key, value in key_value:
worker(key, value, _buff, _rs, _part_ind_dict, _sample_size)
hmget(part_dict, rs, buff, part_ind_dict, sample_size)
return buff
def hmget_helper(rs, query, num_parts):
"""hmget_helper"""
buff = [b""] * len(query)
part_dict = {}
part_ind_dict = {}
for ind, q in enumerate(query):
part = crc16_hash(q) % num_parts
part = "part-%s" % part
if part not in part_dict:
part_dict[part] = []
part_ind_dict[part] = []
part_dict[part].append(q)
part_ind_dict[part].append(ind)
def worker(_key, _value, _buff, _rs, _part_ind_dict):
"""worker"""
response = _rs.hmget(_key, _value)
for res, ind in zip(response, _part_ind_dict[_key]):
buff[ind] = res
def hmget(_part_dict, _rs, _buff, _part_ind_dict):
"""hmget"""
key_value = list(_part_dict.items())
np.random.shuffle(key_value)
for key, value in key_value:
worker(key, value, _buff, _rs, _part_ind_dict)
hmget(part_dict, rs, buff, part_ind_dict)
return buff
class RedisGraph(pgraph.Graph):
"""RedisGraph"""
def __init__(self, name, redis_config, num_parts):
self._rs = RedisCluster(startup_nodes=redis_config)
self.num_parts = num_parts
self._name = name
self._num_nodes = None
self._num_edges = None
self._node_feat_info = None
self._edge_feat_info = None
self._node_feat_dtype = None
self._edge_feat_dtype = None
self._node_feat_shape = None
self._edge_feat_shape = None
@property
def num_nodes(self):
"""num_nodes"""
if self._num_nodes is None:
self._num_nodes = int(self._rs.get("num_nodes"))
return self._num_nodes
@property
def num_edges(self):
"""num_edges"""
if self._num_edges is None:
self._num_edges = int(self._rs.get("num_edges"))
return self._num_edges
def node_feat_info(self):
"""node_feat_info"""
if self._node_feat_info is None:
buff = self._rs.get("nf:infos")
self._node_feat_info = json.loads(buff.decode())
return self._node_feat_info
def node_feat_dtype(self, key):
"""node_feat_dtype"""
if self._node_feat_dtype is None:
self._node_feat_dtype = {}
for key, _, dtype in self.node_feat_info():
self._node_feat_dtype[key] = dtype
return self._node_feat_dtype[key]
def node_feat_shape(self, key):
"""node_feat_shape"""
if self._node_feat_shape is None:
self._node_feat_shape = {}
for key, shape, _ in self.node_feat_info():
self._node_feat_shape[key] = shape
return self._node_feat_shape[key]
def edge_feat_shape(self, key):
"""edge_feat_shape"""
if self._edge_feat_shape is None:
self._edge_feat_shape = {}
for key, shape, _ in self.edge_feat_info():
self._edge_feat_shape[key] = shape
return self._edge_feat_shape[key]
def edge_feat_dtype(self, key):
"""edge_feat_dtype"""
if self._edge_feat_dtype is None:
self._edge_feat_dtype = {}
for key, _, dtype in self.edge_feat_info():
self._edge_feat_dtype[key] = dtype
return self._edge_feat_dtype[key]
def edge_feat_info(self):
"""edge_feat_info"""
if self._edge_feat_info is None:
buff = self._rs.get("ef:infos")
self._edge_feat_info = json.loads(buff.decode())
return self._edge_feat_info
def sample_predecessor(self, nodes, max_degree, return_eids=False):
"""sample_predecessor"""
query = ["d:%s" % n for n in nodes]
rets = hmget_sample_helper(self._rs, query, self.num_parts, max_degree)
v = []
eid = []
for buff in rets:
if buff is None:
v.append(np.array([], dtype="int64"))
eid.append(np.array([], dtype="int64"))
else:
npret = np.frombuffer(
buff, dtype="int64").reshape([-1, 2]).astype("int64")
v.append(npret[:, 0])
eid.append(npret[:, 1])
if return_eids:
return np.array(v), np.array(eid)
else:
return np.array(v)
def sample_successor(self, nodes, max_degree, return_eids=False):
"""sample_successor"""
query = ["s:%s" % n for n in nodes]
rets = hmget_sample_helper(self._rs, query, self.num_parts, max_degree)
v = []
eid = []
for buff in rets:
if buff is None:
v.append(np.array([], dtype="int64"))
eid.append(np.array([], dtype="int64"))
else:
npret = np.frombuffer(
buff, dtype="int64").reshape([-1, 2]).astype("int64")
v.append(npret[:, 0])
eid.append(npret[:, 1])
if return_eids:
return np.array(v), np.array(eid)
else:
return np.array(v)
def predecessor(self, nodes, return_eids=False):
"""predecessor"""
query = ["d:%s" % n for n in nodes]
ret = hmget_helper(self._rs, query, self.num_parts)
v = []
eid = []
for buff in ret:
if buff is not None:
npret = np.frombuffer(
buff, dtype="int64").reshape([-1, 2]).astype("int64")
v.append(npret[:, 0])
eid.append(npret[:, 1])
else:
v.append(np.array([], dtype="int64"))
eid.append(np.array([], dtype="int64"))
if return_eids:
return np.array(v), np.array(eid)
else:
return np.array(v)
def successor(self, nodes, return_eids=False):
"""successor"""
query = ["s:%s" % n for n in nodes]
ret = hmget_helper(self._rs, query, self.num_parts)
v = []
eid = []
for buff in ret:
if buff is not None:
npret = np.frombuffer(
buff, dtype="int64").reshape([-1, 2]).astype("int64")
v.append(npret[:, 0])
eid.append(npret[:, 1])
else:
v.append(np.array([], dtype="int64"))
eid.append(np.array([], dtype="int64"))
if return_eids:
return np.array(v), np.array(eid)
else:
return np.array(v)
def get_edges_by_id(self, eids):
"""get_edges_by_id"""
queries = ["e:%s" % e for e in eids]
ret = hmget_helper(self._rs, queries, self.num_parts)
o = np.asarray(ret, dtype="int64")
dst = o % self.num_nodes
src = o // self.num_nodes
data = np.hstack(
[src.reshape([-1, 1]), dst.reshape([-1, 1])]).astype("int64")
return data
def get_node_feat_by_id(self, key, nodes):
"""get_node_feat_by_id"""
queries = ["nf:%s:%i" % (key, nid) for nid in nodes]
ret = hmget_helper(self._rs, queries, self.num_parts)
ret = b"".join(ret)
data = np.frombuffer(ret, dtype=self.node_feat_dtype(key))
data = data.reshape(self.node_feat_shape(key))
return data
def get_edge_feat_by_id(self, key, eids):
"""get_edge_feat_by_id"""
queries = ["ef:%s:%i" % (key, e) for e in eids]
ret = hmget_helper(self._rs, queries, self.num_parts)
ret = b"".join(ret)
data = np.frombuffer(ret, dtype=self.edge_feat_dtype(key))
data = data.reshape(self.edge_feat_shape(key))
return data
def subgraph(self, nodes, eid, edges=None):
"""Generate subgraph with nodes and edge ids.
This function will generate a :code:`pgl.graph.Subgraph` object and
copy all corresponding node and edge features. Nodes and edges will
be reindex from 0.
WARNING: ALL NODES IN EID MUST BE INCLUDED BY NODES
Args:
nodes: Node ids which will be included in the subgraph.
eid: Edge ids which will be included in the subgraph.
Return:
A :code:`pgl.graph.Subgraph` object.
"""
reindex = {}
for ind, node in enumerate(nodes):
reindex[node] = ind
if edges is None:
edges = self.get_edges_by_id(eid)
else:
edges = np.array(edges, dtype="int64")
sub_edges = graph_kernel.map_edges(
np.arange(
len(edges), dtype="int64"), edges, reindex)
sub_edge_feat = {}
for key, _, _ in self.edge_feat_info():
sub_edge_feat[key] = self.get_edge_feat_by_id(key, eid)
sub_node_feat = {}
for key, _, _ in self.node_feat_info():
sub_node_feat[key] = self.get_node_feat_by_id(key, nodes)
subgraph = pgraph.SubGraph(
num_nodes=len(nodes),
edges=sub_edges,
node_feat=sub_node_feat,
edge_feat=sub_edge_feat,
reindex=reindex)
return subgraph
def node_batch_iter(self, batch_size, shuffle=True):
"""Node batch iterator
Iterate all node by batch.
Args:
batch_size: The batch size of each batch of nodes.
shuffle: Whether shuffle the nodes.
Return:
Batch iterator
"""
perm = np.arange(self.num_nodes, dtype="int64")
if shuffle:
np.random.shuffle(perm)
start = 0
while start < self._num_nodes:
yield perm[start:start + batch_size]
start += batch_size
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""redis_hetergraph"""
import pgl
import redis
from redis import BlockingConnectionPool, StrictRedis
from redis._compat import b, unicode, bytes, long, basestring
from rediscluster.nodemanager import NodeManager
from rediscluster.crc import crc16
from collections import OrderedDict
import threading
import numpy as np
import time
import json
import pgl.graph as pgraph
import pickle as pkl
from pgl.utils.logger import log
import pgl.graph_kernel as graph_kernel
from pgl import heter_graph
import pgl.redis_graph as rg
class RedisHeterGraph(rg.RedisGraph):
"""Redis Heterogeneous Graph"""
def __init__(self, name, edge_types, redis_config, num_parts):
super(RedisHeterGraph, self).__init__(name, redis_config, num_parts)
self._num_edges = {}
self.edge_types = edge_types
self.e_type = None
self._edge_feat_info = {}
self._edge_feat_dtype = {}
self._edge_feat_shape = {}
def num_edges_by_type(self, e_type):
"""get edge number by specified edge type"""
if e_type not in self._num_edges:
self._num_edges[e_type] = int(
self._rs.get("%s:num_edges" % e_type))
return self._num_edges[e_type]
def num_edges(self):
"""num_edges"""
num_edges = {}
for e_type in self.edge_types:
num_edges[e_type] = self.num_edges_by_type(e_type)
return num_edges
def edge_feat_info_by_type(self, e_type):
"""get edge features information by specified edge type"""
if e_type not in self._edge_feat_info:
buff = self._rs.get("%s:ef:infos" % e_type)
if buff is not None:
self._edge_feat_info[e_type] = json.loads(buff.decode())
else:
self._edge_feat_info[e_type] = []
return self._edge_feat_info[e_type]
def edge_feat_info(self):
"""edge_feat_info"""
edge_feat_info = {}
for e_type in self.edge_types:
edge_feat_info[e_type] = self.edge_feat_info_by_type(e_type)
return edge_feat_info
def edge_feat_shape(self, e_type, key):
"""edge_feat_shape"""
if e_type not in self._edge_feat_shape:
e_feat_shape = {}
for k, shape, _ in self.edge_feat_info()[e_type]:
e_feat_shape[k] = shape
self._edge_feat_shape[e_type] = e_feat_shape
return self._edge_feat_shape[e_type][key]
def edge_feat_dtype(self, e_type, key):
"""edge_feat_dtype"""
if e_type not in self._edge_feat_dtype:
e_feat_dtype = {}
for k, _, dtype in self.edge_feat_info()[e_type]:
e_feat_dtype[k] = dtype
self._edge_feat_dtype[e_type] = e_feat_dtype
return self._edge_feat_dtype[e_type][key]
def sample_predecessor(self, e_type, nodes, max_degree, return_eids=False):
"""sample predecessor with the specified edge type"""
query = ["%s:d:%s" % (e_type, n) for n in nodes]
rets = rg.hmget_sample_helper(self._rs, query, self.num_parts,
max_degree)
v = []
eid = []
for buff in rets:
if buff is None:
v.append(np.array([], dtype="int64"))
eid.append(np.array([], dtype="int64"))
else:
npret = np.frombuffer(
buff, dtype="int64").reshape([-1, 2]).astype("int64")
v.append(npret[:, 0])
eid.append(npret[:, 1])
if return_eids:
return np.array(v), np.array(eid)
else:
return np.array(v)
def sample_successor(self, e_type, nodes, max_degree, return_eids=False):
"""sample successor with the specified edge type"""
query = ["%s:s:%s" % (e_type, n) for n in nodes]
rets = rg.hmget_sample_helper(self._rs, query, self.num_parts,
max_degree)
v = []
eid = []
for buff in rets:
if buff is None:
v.append(np.array([], dtype="int64"))
eid.append(np.array([], dtype="int64"))
else:
npret = np.frombuffer(
buff, dtype="int64").reshape([-1, 2]).astype("int64")
v.append(npret[:, 0])
eid.append(npret[:, 1])
if return_eids:
return np.array(v), np.array(eid)
else:
return np.array(v)
def predecessor(self, e_type, nodes, return_eids=False):
"""predecessor with the specified edge type"""
query = ["%s:d:%s" % (e_type, n) for n in nodes]
ret = rg.hmget_helper(self._rs, query, self.num_parts)
v = []
eid = []
for buff in ret:
if buff is not None:
npret = np.frombuffer(
buff, dtype="int64").reshape([-1, 2]).astype("int64")
v.append(npret[:, 0])
eid.append(npret[:, 1])
else:
v.append(np.array([], dtype="int64"))
eid.append(np.array([], dtype="int64"))
if return_eids:
return np.array(v), np.array(eid)
else:
return np.array(v)
def successor(self, e_type, nodes, return_eids=False):
"""successor with the specified edge type"""
query = ["%s:s:%s" % (e_type, n) for n in nodes]
ret = rg.hmget_helper(self._rs, query, self.num_parts)
v = []
eid = []
for buff in ret:
if buff is not None:
npret = np.frombuffer(
buff, dtype="int64").reshape([-1, 2]).astype("int64")
v.append(npret[:, 0])
eid.append(npret[:, 1])
else:
v.append(np.array([], dtype="int64"))
eid.append(np.array([], dtype="int64"))
if return_eids:
return np.array(v), np.array(eid)
else:
return np.array(v)
def get_edges_by_id(self, e_type, eids):
"""get_edges_by_id"""
queries = ["%s:e:%s" % (e_type, e) for e in eids]
ret = rg.hmget_helper(self._rs, queries, self.num_parts)
o = np.asarray(ret, dtype="int64")
dst = o % self.num_nodes
src = o // self.num_nodes
data = np.hstack(
[src.reshape([-1, 1]), dst.reshape([-1, 1])]).astype("int64")
return data
def get_edge_feat_by_id(self, e_type, key, eids):
"""get_edge_feat_by_id"""
queries = ["%s:ef:%s:%i" % (e_type, key, e) for e in eids]
ret = rg.hmget_helper(self._rs, queries, self.num_parts)
if ret is None:
return None
else:
ret = b"".join(ret)
data = np.frombuffer(ret, dtype=self.edge_feat_dtype(e_type, key))
data = data.reshape(self.edge_feat_shape(e_type, key))
return data
def get_node_types(self, nodes):
"""get_node_types """
queries = ["nt:%i" % n for n in nodes]
ret = rg.hmget_helper(self._rs, queries, self.num_parts)
node_types = []
for buff in ret:
if buff:
node_types.append(buff.decode())
else:
node_types = None
return node_types
def subgraph(self, nodes, eid, edges=None):
"""Generate heterogeneous subgraph with nodes and edge ids.
WARNING: ALL NODES IN EID MUST BE INCLUDED BY NODES
Args:
nodes: Node ids which will be included in the subgraph.
eid: Edge ids which will be included in the subgraph.
Return:
A :code:`pgl.heter_graph.Subgraph` object.
"""
reindex = {}
for ind, node in enumerate(nodes):
reindex[node] = ind
_node_types = self.get_node_types(nodes)
if _node_types is None:
node_types = None
else:
node_types = []
for idx, t in zip(nodes, _node_types):
node_types.append([reindex[idx], t])
if edges is None:
edges = {}
for e_type, eid_list in eid.items():
edges[e_type] = self.get_edges_by_id(e_type, eid_list)
sub_edges = {}
for e_type, edges_list in edges.items():
sub_edges[e_type] = graph_kernel.map_edges(
np.arange(
len(edges_list), dtype="int64"), edges_list, reindex)
sub_edge_feat = {}
for e_type, edge_feat_info in self.edge_feat_info().items():
type_edge_feat = {}
for key, _, _ in edge_feat_info:
type_edge_feat[key] = self.get_edge_feat_by_id(e_type, key,
eid)
sub_edge_feat[e_type] = type_edge_feat
sub_node_feat = {}
for key, _, _ in self.node_feat_info():
sub_node_feat[key] = self.get_node_feat_by_id(key, nodes)
subgraph = heter_graph.SubHeterGraph(
num_nodes=len(nodes),
edges=sub_edges,
node_types=node_types,
node_feat=sub_node_feat,
edge_feat=sub_edge_feat,
reindex=reindex)
return subgraph
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This package implement graph sampling algorithm.
"""
import time
import copy
import numpy as np
import pgl
from pgl.utils.logger import log
from pgl import graph_kernel
__all__ = [
'graphsage_sample', 'node2vec_sample', 'deepwalk_sample',
'metapath_randomwalk', 'pinsage_sample'
]
def traverse(item):
"""traverse the list or numpy"""
if isinstance(item, list) or isinstance(item, np.ndarray):
for i in iter(item):
for j in traverse(i):
yield j
else:
yield item
def flat_node_and_edge(nodes, eids, weights=None):
"""flatten the sub-lists to one list"""
nodes = list(set(traverse(nodes)))
eids = list(traverse(eids))
if weights is not None:
weights = list(traverse(weights))
return nodes, eids, weights
def edge_hash(src, dst):
"""edge_hash
"""
return src * 100000007 + dst
def graphsage_sample(graph, nodes, samples, ignore_edges=[]):
"""Implement of graphsage sample.
Reference paper: https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf.
Args:
graph: A pgl graph instance
nodes: Sample starting from nodes
samples: A list, number of neighbors in each layer
ignore_edges: list of edge(src, dst) will be ignored.
Return:
A list of subgraphs
"""
start = time.time()
num_layers = len(samples)
start_nodes = nodes
nodes = list(start_nodes)
eids, edges = [], []
nodes_set = set(nodes)
layer_nodes, layer_eids, layer_edges = [], [], []
ignore_edge_set = set([edge_hash(src, dst) for src, dst in ignore_edges])
for layer_idx in reversed(range(num_layers)):
if len(start_nodes) == 0:
layer_nodes = [nodes] + layer_nodes
layer_eids = [eids] + layer_eids
layer_edges = [edges] + layer_edges
continue
batch_pred_nodes, batch_pred_eids = graph.sample_predecessor(
start_nodes, samples[layer_idx], return_eids=True)
start = time.time()
last_nodes_set = nodes_set
nodes, eids = copy.copy(nodes), copy.copy(eids)
edges = copy.copy(edges)
nodes_set, eids_set = set(nodes), set(eids)
for srcs, dst, pred_eids in zip(batch_pred_nodes, start_nodes,
batch_pred_eids):
for src, eid in zip(srcs, pred_eids):
if edge_hash(src, dst) in ignore_edge_set:
continue
if eid not in eids_set:
eids.append(eid)
edges.append([src, dst])
eids_set.add(eid)
if src not in nodes_set:
nodes.append(src)
nodes_set.add(src)
layer_edges = [edges] + layer_edges
start_nodes = list(nodes_set - last_nodes_set)
layer_nodes = [nodes] + layer_nodes
layer_eids = [eids] + layer_eids
start = time.time()
# Find new nodes
feed_dict = {}
subgraphs = []
for i in range(num_layers):
subgraphs.append(
graph.subgraph(
nodes=layer_nodes[0], eid=layer_eids[i], edges=layer_edges[i]))
# only for this task
subgraphs[i].node_feat["index"] = np.array(
layer_nodes[0], dtype="int64")
return subgraphs
def alias_sample(size, alias, events):
"""Implement of alias sample.
Args:
size: Output shape.
alias: The alias table build by `alias_sample_build_table`.
events: The events table build by `alias_sample_build_table`.
Return:
samples: The generated random samples.
"""
rand_num = np.random.uniform(0.0, len(alias), size)
idx = rand_num.astype("int64")
uni = rand_num - idx
flags = (uni >= alias[idx])
idx[flags] = events[idx][flags]
return idx
def graph_alias_sample_table(graph, edge_weight_name):
"""Build alias sample table for weighted deepwalk.
Args:
graph: The input graph
edge_weight_name: The name of edge weight in edge_feat.
Return:
Alias sample tables for each nodes.
"""
edge_weight = graph.edge_feat[edge_weight_name]
_, eids_array = graph.successor(return_eids=True)
alias_array, events_array = [], []
for eids in eids_array:
probs = edge_weight[eids]
probs /= np.sum(probs)
alias, events = graph_kernel.alias_sample_build_table(probs)
alias_array.append(alias), events_array.append(events)
alias_array, events_array = np.array(alias_array), np.array(events_array)
return alias_array, events_array
def deepwalk_sample(graph, nodes, max_depth, alias_name=None,
events_name=None):
"""Implement of random walk.
This function get random walks path for given nodes and depth.
Args:
nodes: Walk starting from nodes
max_depth: Max walking depth
Return:
A list of walks.
"""
walk = []
# init
for node in nodes:
walk.append([node])
cur_walk_ids = np.arange(0, len(nodes))
cur_nodes = np.array(nodes)
for l in range(max_depth):
# select the walks not end
cur_succs = graph.successor(cur_nodes)
mask = [len(succ) > 0 for succ in cur_succs]
if np.any(mask):
cur_walk_ids = cur_walk_ids[mask]
cur_nodes = cur_nodes[mask]
cur_succs = cur_succs[mask]
else:
# stop when all nodes have no successor
break
if alias_name is not None and events_name is not None:
sample_index = [
alias_sample([1], graph.node_feat[alias_name][node],
graph.node_feat[events_name][node])[0]
for node in cur_nodes
]
else:
outdegree = [len(cur_succ) for cur_succ in cur_succs]
sample_index = np.floor(
np.random.rand(cur_succs.shape[0]) * outdegree).astype("int64")
nxt_cur_nodes = []
for s, ind, walk_id in zip(cur_succs, sample_index, cur_walk_ids):
walk[walk_id].append(s[ind])
nxt_cur_nodes.append(s[ind])
cur_nodes = np.array(nxt_cur_nodes)
return walk
def node2vec_sample(graph, nodes, max_depth, p=1.0, q=1.0):
"""Implement of node2vec random walk.
Reference paper: https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf.
Args:
graph: A pgl graph instance
nodes: Walk starting from nodes
max_depth: Max walking depth
p: Return parameter
q: In-out parameter
Return:
A list of walks.
"""
if p == 1.0 and q == 1.0:
return deepwalk_sample(graph, nodes, max_depth)
walk = []
# init
for node in nodes:
walk.append([node])
cur_walk_ids = np.arange(0, len(nodes))
cur_nodes = np.array(nodes)
prev_nodes = np.array([-1] * len(nodes), dtype="int64")
prev_succs = np.array([[]] * len(nodes), dtype="int64")
for l in range(max_depth):
# select the walks not end
cur_succs = graph.successor(cur_nodes)
mask = [len(succ) > 0 for succ in cur_succs]
if np.any(mask):
cur_walk_ids = cur_walk_ids[mask]
cur_nodes = cur_nodes[mask]
prev_nodes = prev_nodes[mask]
prev_succs = prev_succs[mask]
cur_succs = cur_succs[mask]
else:
# stop when all nodes have no successor
break
num_nodes = cur_nodes.shape[0]
nxt_nodes = np.zeros(num_nodes, dtype="int64")
for idx, (
succ, prev_succ, walk_id, prev_node
) in enumerate(zip(cur_succs, prev_succs, cur_walk_ids, prev_nodes)):
sampled_succ = graph_kernel.node2vec_sample(succ, prev_succ,
prev_node, p, q)
walk[walk_id].append(sampled_succ)
nxt_nodes[idx] = sampled_succ
prev_nodes, prev_succs = cur_nodes, cur_succs
cur_nodes = nxt_nodes
return walk
def metapath_randomwalk(graph,
start_nodes,
metapath,
walk_length,
alias_name=None,
events_name=None):
"""Implementation of metapath random walk in heterogeneous graph.
Args:
graph: instance of pgl heterogeneous graph
start_nodes: start nodes to generate walk
metapath: meta path for sample nodes.
e.g: "c2p-p2a-a2p-p2c"
walk_length: the walk length
Return:
a list of metapath walks.
"""
edge_types = metapath.split('-')
walk = []
for node in start_nodes:
walk.append([node])
cur_walk_ids = np.arange(0, len(start_nodes))
cur_nodes = np.array(start_nodes)
mp_len = len(edge_types)
for i in range(0, walk_length - 1):
g = graph[edge_types[i % mp_len]]
cur_succs = g.successor(cur_nodes)
mask = [len(succ) > 0 for succ in cur_succs]
if np.any(mask):
cur_walk_ids = cur_walk_ids[mask]
cur_nodes = cur_nodes[mask]
cur_succs = cur_succs[mask]
else:
# stop when all nodes have no successor
break
if alias_name is not None and events_name is not None:
sample_index = [
alias_sample([1], g.node_feat[alias_name][node],
g.node_feat[events_name][node])[0]
for node in cur_nodes
]
else:
outdegree = [len(cur_succ) for cur_succ in cur_succs]
sample_index = np.floor(
np.random.rand(cur_succs.shape[0]) * outdegree).astype("int64")
nxt_cur_nodes = []
for s, ind, walk_id in zip(cur_succs, sample_index, cur_walk_ids):
walk[walk_id].append(s[ind])
nxt_cur_nodes.append(s[ind])
cur_nodes = np.array(nxt_cur_nodes)
return walk
def random_walk_with_start_prob(graph, nodes, max_depth, proba=0.5):
"""Implement of random walk with the probability of returning the origin node.
This function get random walks path for given nodes and depth.
Args:
nodes: Walk starting from nodes
max_depth: Max walking depth
proba: the proba to return the origin node
Return:
A list of walks.
"""
walk = []
# init
for node in nodes:
walk.append([node])
walk_ids = np.arange(0, len(nodes))
cur_nodes = np.array(nodes)
nodes = np.array(nodes)
for l in range(max_depth):
# select the walks not end
if l >= 1:
return_proba = np.random.rand(cur_nodes.shape[0])
proba_mask = (return_proba < proba)
cur_nodes[proba_mask] = nodes[proba_mask]
outdegree = graph.outdegree(cur_nodes)
mask = (outdegree != 0)
if np.any(mask):
cur_walk_ids = walk_ids[mask]
outdegree = outdegree[mask]
else:
# stop when all nodes have no successor, wait start next loop to get precesssor
continue
succ = graph.successor(cur_nodes[mask])
sample_index = np.floor(
np.random.rand(outdegree.shape[0]) * outdegree).astype("int64")
nxt_cur_nodes = cur_nodes
for s, ind, walk_id in zip(succ, sample_index, cur_walk_ids):
walk[walk_id].append(s[ind])
nxt_cur_nodes[walk_id] = s[ind]
cur_nodes = np.array(nxt_cur_nodes)
return walk
def pinsage_sample(graph,
nodes,
samples,
top_k=10,
proba=0.5,
norm_bais=1.0,
ignore_edges=set()):
"""Implement of graphsage sample.
Reference paper: .
Args:
graph: A pgl graph instance
nodes: Sample starting from nodes
samples: A list, number of neighbors in each layer
top_k: select the top_k visit count nodes to construct the edges
proba: the probability to return the origin node
norm_bais: the normlization for the visit count
ignore_edges: list of edge(src, dst) will be ignored.
Return:
A list of subgraphs
"""
start = time.time()
num_layers = len(samples)
start_nodes = nodes
edges, weights = [], []
layer_nodes, layer_edges, layer_weights = [], [], []
ignore_edge_set = set([edge_hash(src, dst) for src, dst in ignore_edges])
for layer_idx in reversed(range(num_layers)):
if len(start_nodes) == 0:
layer_nodes = [nodes] + layer_nodes
layer_edges = [edges] + layer_edges
layer_edges_weight = [weights] + layer_weights
continue
walks = random_walk_with_start_prob(
graph, start_nodes, samples[layer_idx], proba=proba)
walks = [walk[1:] for walk in walks]
pred_edges = []
pred_weights = []
pred_nodes = []
for node, walk in zip(start_nodes, walks):
walk_nodes = []
walk_weights = []
count_sum = 0
for random_walk_node in walk:
if len(ignore_edge_set) > 0 and random_walk_node != node and \
edge_hash(random_walk_node, node) in ignore_edge_set:
continue
walk_nodes.append(random_walk_node)
unique, counts = np.unique(walk_nodes, return_counts=True)
frequencies = np.asarray((unique, counts)).T
frequencies = frequencies[np.argsort(frequencies[:, 1])]
frequencies = frequencies[-1 * top_k:, :]
for random_walk_node, random_count in zip(
frequencies[:, 0].tolist(), frequencies[:, 1].tolist()):
pred_nodes.append(random_walk_node)
pred_edges.append((random_walk_node, node))
walk_weights.append(random_count)
count_sum += random_count
count_sum += len(walk_weights) * norm_bais
walk_weights = (np.array(walk_weights) + norm_bais) / (count_sum)
pred_weights.extend(walk_weights.tolist())
last_node_set = set(nodes)
nodes, edges, weights = flat_node_and_edge([nodes, pred_nodes], \
[edges, pred_edges], [weights, pred_weights])
layer_edges = [edges] + layer_edges
layer_weights = [weights] + layer_weights
layer_nodes = [nodes] + layer_nodes
start_nodes = list(set(nodes) - last_node_set)
start = time.time()
feed_dict = {}
subgraphs = []
for i in range(num_layers):
edge_feat_dict = {
"weight": np.array(
layer_weights[i], dtype='float32')
}
subgraphs.append(
graph.subgraph(
nodes=layer_nodes[0],
edges=layer_edges[i],
edge_feats=edge_feat_dict))
subgraphs[i].node_feat["index"] = np.array(
layer_nodes[0], dtype="int64")
return subgraphs
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""test_alias_sample"""
import argparse
import time
import unittest
from collections import Counter
import numpy as np
from pgl.graph_kernel import alias_sample_build_table
from pgl.sample import alias_sample
class AliasSampleTest(unittest.TestCase):
"""AliasSampleTest
"""
def setUp(self):
pass
def test_speed(self):
"""test_speed
"""
num = 1000
size = [10240, 1, 5]
probs = np.random.uniform(0.0, 1.0, [num])
probs /= np.sum(probs)
start = time.time()
alias, events = alias_sample_build_table(probs)
for i in range(100):
alias_sample(size, alias, events)
alias_sample_time = time.time() - start
start = time.time()
for i in range(100):
np.random.choice(num, size, p=probs)
np_sample_time = time.time() - start
self.assertTrue(alias_sample_time < np_sample_time)
def test_resut(self):
"""test_result
"""
size = [450000]
num = 10
probs = np.arange(1, num).astype(np.float64)
probs /= np.sum(probs)
alias, events = alias_sample_build_table(probs)
ret = alias_sample(size, alias, events)
cnt = Counter(ret)
sort_cnt_keys = [x[1] for x in sorted(zip(cnt.values(), cnt.keys()))]
self.assertEqual(sort_cnt_keys, np.arange(0, num - 1).tolist())
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This file is for testing gin layer.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import unittest
import numpy as np
import paddle.fluid as F
import paddle.fluid.layers as L
from pgl.layers.conv import gin
from pgl import graph
from pgl import graph_wrapper
class GinTest(unittest.TestCase):
"""GinTest
"""
def test_gin(self):
"""test_gin
"""
np.random.seed(1)
hidden_size = 8
num_nodes = 10
edges = [(1, 4), (0, 5), (1, 9), (1, 8), (2, 8), (2, 5), (3, 6),
(3, 7), (3, 4), (3, 8)]
inver_edges = [(v, u) for u, v in edges]
edges.extend(inver_edges)
node_feat = {"feature": np.random.rand(10, 4).astype("float32")}
g = graph.Graph(num_nodes=num_nodes, edges=edges, node_feat=node_feat)
use_cuda = False
place = F.CUDAPlace(0) if use_cuda else F.CPUPlace()
prog = F.Program()
startup_prog = F.Program()
with F.program_guard(prog, startup_prog):
gw = graph_wrapper.GraphWrapper(
name='graph',
place=place,
node_feat=g.node_feat_info(),
edge_feat=g.edge_feat_info())
output = gin(gw,
gw.node_feat['feature'],
hidden_size=hidden_size,
activation='relu',
name='gin',
init_eps=1,
train_eps=True)
exe = F.Executor(place)
exe.run(startup_prog)
ret = exe.run(prog, feed=gw.to_feed(g), fetch_list=[output])
self.assertEqual(ret[0].shape[0], num_nodes)
self.assertEqual(ret[0].shape[1], hidden_size)
if __name__ == "__main__":
unittest.main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""test_hetergraph"""
import time
import unittest
import json
import os
import numpy as np
from pgl.sample import metapath_randomwalk
from pgl.graph import Graph
from pgl import heter_graph
class HeterGraphTest(unittest.TestCase):
"""HeterGraph test
"""
@classmethod
def setUpClass(cls):
np.random.seed(1)
edges = {}
# for test no successor
edges['c2p'] = [(1, 4), (0, 5), (1, 9), (1, 8), (2, 8), (2, 5), (3, 6),
(3, 7), (3, 4), (3, 8)]
edges['p2c'] = [(v, u) for u, v in edges['c2p']]
edges['p2a'] = [(4, 10), (4, 11), (4, 12), (4, 14), (4, 13), (6, 12),
(6, 11), (6, 14), (7, 12), (7, 11), (8, 14), (9, 10)]
edges['a2p'] = [(v, u) for u, v in edges['p2a']]
# for test speed
# edges['c2p'] = [(0, 4), (0, 5), (1, 9), (1,8), (2,8), (2,5), (3,6), (3,7), (3,4), (3,8)]
# edges['p2c'] = [(v,u) for u, v in edges['c2p']]
# edges['p2a'] = [(4,10), (4,11), (4,12), (4,14), (5,13), (6,13), (6,11), (6,14), (7,12), (7,11), (8,14), (9,13)]
# edges['a2p'] = [(v,u) for u, v in edges['p2a']]
node_types = ['c' for _ in range(4)] + ['p' for _ in range(6)
] + ['a' for _ in range(5)]
node_types = [(i, t) for i, t in enumerate(node_types)]
cls.graph = heter_graph.HeterGraph(
num_nodes=len(node_types), edges=edges, node_types=node_types)
def test_num_nodes_by_type(self):
print()
n_types = {'c': 4, 'p': 6, 'a': 5}
for nt in n_types:
num_nodes = self.graph.num_nodes_by_type(nt)
self.assertEqual(num_nodes, n_types[nt])
def test_node_batch_iter(self):
print()
batch_size = 2
ground = [[4, 5], [6, 7], [8, 9]]
for idx, nodes in enumerate(
self.graph.node_batch_iter(
batch_size=batch_size, shuffle=False, n_type='p')):
self.assertEqual(len(nodes), batch_size)
self.assertListEqual(list(nodes), ground[idx])
def test_sample_successor(self):
print()
nodes = [4, 5, 8]
md = 2
succes = self.graph.sample_successor(
edge_type='p2a', nodes=nodes, max_degree=md, return_eids=False)
self.assertIsInstance(succes, list)
ground = [[10, 11, 12, 14, 13], [], [14]]
for succ, g in zip(succes, ground):
self.assertIsInstance(succ, np.ndarray)
for i in succ:
self.assertIn(i, g)
nodes = [4]
succes = self.graph.sample_successor(
edge_type='p2a', nodes=nodes, max_degree=md, return_eids=False)
self.assertIsInstance(succes, list)
ground = [[10, 11, 12, 14, 13]]
for succ, g in zip(succes, ground):
self.assertIsInstance(succ, np.ndarray)
for i in succ:
self.assertIn(i, g)
def test_successor(self):
print()
nodes = [4, 5, 8]
e_type = 'p2a'
succes = self.graph.successor(
edge_type=e_type,
nodes=nodes, )
self.assertIsInstance(succes, np.ndarray)
ground = [[10, 11, 12, 14, 13], [], [14]]
for succ, g in zip(succes, ground):
self.assertIsInstance(succ, np.ndarray)
self.assertCountEqual(succ, g)
nodes = [4]
e_type = 'p2a'
succes = self.graph.successor(
edge_type=e_type,
nodes=nodes, )
self.assertIsInstance(succes, np.ndarray)
ground = [[10, 11, 12, 14, 13]]
for succ, g in zip(succes, ground):
self.assertIsInstance(succ, np.ndarray)
self.assertCountEqual(succ, g)
def test_sample_nodes(self):
print()
p_ground = [4, 5, 6, 7, 8, 9]
sample_num = 10
nodes = self.graph.sample_nodes(sample_num=sample_num, n_type='p')
self.assertEqual(len(nodes), sample_num)
for n in nodes:
self.assertIn(n, p_ground)
# test n_type == None
ground = [i for i in range(15)]
nodes = self.graph.sample_nodes(sample_num=sample_num, n_type=None)
self.assertEqual(len(nodes), sample_num)
for n in nodes:
self.assertIn(n, ground)
if __name__ == "__main__":
unittest.main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""test_metapath_randomwalk"""
import time
import unittest
import json
import os
import numpy as np
from pgl.sample import metapath_randomwalk
from pgl.graph import Graph
from pgl import heter_graph
np.random.seed(1)
class MetapathRandomwalkTest(unittest.TestCase):
"""metapath_randomwalk test
"""
def setUp(self):
edges = {}
# for test no successor
edges['c2p'] = [(1, 4), (0, 5), (1, 9), (1, 8), (2, 8), (2, 5), (3, 6),
(3, 7), (3, 4), (3, 8)]
edges['p2c'] = [(v, u) for u, v in edges['c2p']]
edges['p2a'] = [(4, 10), (4, 11), (4, 12), (4, 14), (4, 13), (6, 12),
(6, 11), (6, 14), (7, 12), (7, 11), (8, 14), (9, 10)]
edges['a2p'] = [(v, u) for u, v in edges['p2a']]
# for test speed
# edges['c2p'] = [(0, 4), (0, 5), (1, 9), (1,8), (2,8), (2,5), (3,6), (3,7), (3,4), (3,8)]
# edges['p2c'] = [(v,u) for u, v in edges['c2p']]
# edges['p2a'] = [(4,10), (4,11), (4,12), (4,14), (5,13), (6,13), (6,11), (6,14), (7,12), (7,11), (8,14), (9,13)]
# edges['a2p'] = [(v,u) for u, v in edges['p2a']]
self.node_types = ['c' for _ in range(4)] + [
'p' for _ in range(6)
] + ['a' for _ in range(5)]
node_types = [(i, t) for i, t in enumerate(self.node_types)]
self.graph = heter_graph.HeterGraph(
num_nodes=len(node_types), edges=edges, node_types=node_types)
def test_metapath_randomwalk(self):
meta_path = 'c2p-p2a-a2p-p2c'
path = ['c', 'p', 'a', 'p', 'c']
start_nodes = [0, 1, 2, 3]
walk_len = 10
walks = metapath_randomwalk(
graph=self.graph,
start_nodes=start_nodes,
metapath=meta_path,
walk_length=walk_len)
self.assertEqual(len(walks), 4)
for walk in walks:
for i in range(len(walk)):
idx = i % (len(path) - 1)
self.assertEqual(self.node_types[walk[i]], path[idx])
if __name__ == "__main__":
unittest.main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""test_redis_graph"""
import time
import unittest
import json
import os
import numpy as np
from pgl.redis_graph import RedisGraph
class RedisGraphTest(unittest.TestCase):
"""RedisGraphTest
"""
def setUp(self):
config_path = os.path.join(
os.path.abspath(os.path.dirname(__file__)),
'test_redis_graph_conf.json')
with open(config_path) as inf:
config = json.load(inf)
redis_configs = [config["redis"], ]
self.graph = RedisGraph(
"reddit-graph", redis_configs, num_parts=config["num_parts"])
def test_random_seed(self):
"""test_random_seed
"""
np.random.seed(1)
data1 = self.graph.sample_predecessor(range(1000), max_degree=5)
data1 = [nid for nodes in data1 for nid in nodes]
np.random.seed(1)
data2 = self.graph.sample_predecessor(range(1000), max_degree=5)
data2 = [nid for nodes in data2 for nid in nodes]
np.random.seed(3)
data3 = self.graph.sample_predecessor(range(1000), max_degree=5)
data3 = [nid for nodes in data3 for nid in nodes]
self.assertEqual(data1, data2)
self.assertNotEqual(data2, data3)
if __name__ == '__main__':
unittest.main()
{
"redis":
{
"host": "10.86.54.13",
"port": "7003"
},
"num_parts": 64
}
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
This package implement graph sampling algorithm.
"""
import unittest
import os
import json
import numpy as np
from pgl.redis_graph import RedisGraph
from pgl.sample import graphsage_sample
from pgl.sample import node2vec_sample
class SampleTest(unittest.TestCase):
"""SampleTest
"""
def setUp(self):
config_path = os.path.join(
os.path.abspath(os.path.dirname(__file__)),
'test_redis_graph_conf.json')
with open(config_path) as inf:
config = json.load(inf)
redis_configs = [config["redis"], ]
self.graph = RedisGraph(
"reddit-graph", redis_configs, num_parts=config["num_parts"])
def test_graphsage_sample(self):
"""test_graphsage_sample
"""
eids = np.random.choice(self.graph.num_edges, 1000)
edges = self.graph.get_edges_by_id(eids)
nodes = [n for edge in edges for n in edge]
ignore_edges = edges.tolist() + edges[:, [1, 0]].tolist()
np.random.seed(1)
subgraphs = graphsage_sample(self.graph, nodes, [10, 10], [])
np.random.seed(1)
subgraphs_ignored = graphsage_sample(self.graph, nodes, [10, 10],
ignore_edges)
self.assertEqual(subgraphs[0].num_nodes,
subgraphs_ignored[0].num_nodes)
self.assertGreaterEqual(subgraphs[0].num_edges,
subgraphs_ignored[0].num_edges)
def test_node2vec_sample(self):
"""test_node2vec_sample
"""
walks = node2vec_sample(self.graph, range(10), 3)
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Comment.
"""
from __future__ import division
from __future__ import absolute_import
from __future__ import print_function
from __future__ import unicode_literals
import unittest
import paddle.fluid as F
import paddle.fluid.layers as L
from pgl.layers.set2set import Set2Set
def paddle_easy_run(model_func, data):
prog = F.Program()
startup_prog = F.Program()
with F.program_guard(prog, startup_prog):
ret = model_func()
place = F.CUDAPlace(0)
exe = F.Executor(place)
exe.run(startup_prog)
return exe.run(prog, fetch_list=ret, feed=data)
class Set2SetTest(unittest.TestCase):
"""Set2SetTest
"""
def test_graphsage_sample(self):
"""test_graphsage_sample
"""
import numpy as np
def model_func():
s2s = Set2Set(5, 1, 3)
h0 = L.data(
name='h0',
shape=[2, 10, 5],
dtype='float32',
append_batch_size=False)
h1 = s2s.forward(h0)
return h1,
data = {"h0": np.random.rand(2, 10, 5).astype("float32")}
h1, = paddle_easy_run(model_func, data)
self.assertEqual(h1.shape[0], 2)
self.assertEqual(h1.shape[1], 10)
if __name__ == "__main__":
unittest.main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimized Multiprocessing Reader for PaddlePaddle
"""
import logging
log = logging.getLogger(__name__)
import multiprocessing
import copy
try:
import ujson as json
except:
log.info("ujson not install, fail back to use json instead")
import json
import numpy as np
import time
import paddle.fluid as fluid
from queue import Queue
import threading
def serialize_data(data):
"""serialize_data"""
if data is None:
return None
return numpy_serialize_data(data) #, ensure_ascii=False)
def numpy_serialize_data(data):
"""serialize_data"""
ret_data = {}
for key in data:
if isinstance(data[key], np.ndarray):
ret_data[key] = (data[key].tobytes(), list(data[key].shape),
"%s" % data[key].dtype)
else:
ret_data[key] = data[key]
return ret_data
def numpy_deserialize_data(data):
"""deserialize_data"""
if data is None:
return None
for key in data:
if isinstance(data[key], tuple):
value = np.frombuffer(
data[key][0], dtype=data[key][2]).reshape(data[key][1])
data[key] = value
return data
def deserialize_data(data):
"""deserialize_data"""
return numpy_deserialize_data(data)
def multiprocess_reader(readers, use_pipe=True, queue_size=1000, pipe_size=10):
"""
multiprocess_reader use python multi process to read data from readers
and then use multiprocess.Queue or multiprocess.Pipe to merge all
data. The process number is equal to the number of input readers, each
process call one reader.
Multiprocess.Queue require the rw access right to /dev/shm, some
platform does not support.
you need to create multiple readers first, these readers should be independent
to each other so that each process can work independently.
An example:
.. code-block:: python
reader0 = reader(["file01", "file02"])
reader1 = reader(["file11", "file12"])
reader1 = reader(["file21", "file22"])
reader = multiprocess_reader([reader0, reader1, reader2],
queue_size=100, use_pipe=False)
"""
assert type(readers) is list and len(readers) > 0
def _read_into_queue(reader, queue):
"""read_into_queue"""
for sample in reader():
if sample is None:
raise ValueError("sample has None")
queue.put(serialize_data(sample))
queue.put(serialize_data(None))
def queue_reader():
"""queue_reader"""
queue = multiprocessing.Queue(queue_size)
for reader in readers:
p = multiprocessing.Process(
target=_read_into_queue, args=(reader, queue))
p.start()
reader_num = len(readers)
finish_num = 0
while finish_num < reader_num:
sample = deserialize_data(queue.get())
if sample is None:
finish_num += 1
else:
yield sample
def _read_into_pipe(reader, conn, max_pipe_size):
"""read_into_pipe"""
for sample in reader():
if sample is None:
raise ValueError("sample has None!")
conn.send(serialize_data(sample))
conn.send(serialize_data(None))
conn.close()
def pipe_reader():
"""pipe_reader"""
conns = []
for reader in readers:
parent_conn, child_conn = multiprocessing.Pipe()
conns.append(parent_conn)
p = multiprocessing.Process(
target=_read_into_pipe, args=(reader, child_conn, pipe_size))
p.start()
reader_num = len(readers)
conn_to_remove = []
finish_flag = np.zeros(len(conns), dtype="int32")
start = time.time()
def queue_worker(sub_conn, que):
while True:
buff = sub_conn.recv()
sample = deserialize_data(buff)
if sample is None:
que.put(None)
sub_conn.close()
break
que.put(sample)
thread_pool = []
output_queue = Queue(maxsize=reader_num)
for i in range(reader_num):
t = threading.Thread(
target=queue_worker, args=(conns[i], output_queue))
t.daemon = True
t.start()
thread_pool.append(t)
finish_num = 0
while finish_num < reader_num:
sample = output_queue.get()
if sample is None:
finish_num += 1
else:
yield sample
for thread in thread_pool:
thread.join()
if use_pipe:
return pipe_reader
else:
return queue_reader
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Optimized Multithreading Reader for PaddlePaddle
"""
import logging
log = logging.getLogger(__name__)
import threading
import queue
import copy
import numpy as np
import time
import paddle.fluid as fluid
def multithreading_reader(readers, queue_size=1000):
"""
multithreading_reader use python multi thread to read data from readers
and then use queue to merge all
data. The process number is equal to the number of input readers, each
process call one reader.
CPU usage rate won't go over 100% with GIL.
you need to create multiple readers first, these readers should be independent
to each other so that each process can work independently.
An example:
.. code-block:: python
reader0 = reader(["file01", "file02"])
reader1 = reader(["file11", "file12"])
reader1 = reader(["file21", "file22"])
reader = multithreading_reader([reader0, reader1, reader2],
queue_size=100)
"""
assert type(readers) is list and len(readers) > 0
def _read_into_queue(reader, queue):
"""read_into_queue"""
for sample in reader():
if sample is None:
raise ValueError("sample has None")
queue.put(sample)
queue.put(None)
def queue_reader():
"""queue_reader"""
output_queue = queue.Queue(queue_size)
thread_pool = []
thread_num = 0
for reader in readers:
p = threading.Thread(
target=_read_into_queue, args=(reader, output_queue))
p.daemon = True
p.start()
thread_pool.append(p)
thread_num += 1
while True:
ret = output_queue.get()
if ret is not None:
yield ret
else:
thread_num -= 1
if thread_num == 0:
break
for thread in thread_pool:
thread.join()
return queue_reader
......@@ -225,3 +225,23 @@ def scatter_add(input, index, updates):
output = fluid.layers.scatter(input, index, updates, overwrite=False)
return output
def scatter_max(input, index, updates):
"""Scatter max updates to input by given index.
Adds sparse updates to input variables.
Args:
input: Input tensor to be updated
index: Slice index
updates: Must have same type as input.
Return:
Same type and shape as input.
"""
output = fluid.layers.scatter(input, index, updates, mode='max')
return output
......@@ -16,10 +16,35 @@ import os
import sys
import re
import codecs
import numpy as np
from setuptools import setup, find_packages
from setuptools.extension import Extension
from Cython.Build import cythonize
from setuptools import Extension
from setuptools import dist
from setuptools.command.build_ext import build_ext as _build_ext
try:
from Cython.Build import cythonize
except ImportError:
def cythonize(*args, **kwargs):
"""cythonize"""
from Cython.Build import cythonize
return cythonize(*args, **kwargs)
class CustomBuildExt(_build_ext):
"""CustomBuildExt"""
def finalize_options(self):
_build_ext.finalize_options(self)
# Prevent numpy from thinking it is still in its setup process:
__builtins__.__NUMPY_SETUP__ = False
import numpy
self.include_dirs.append(numpy.get_include())
workdir = os.path.dirname(os.path.abspath(__file__))
with open(os.path.join(workdir, './requirements.txt')) as f:
requirements = f.read().splitlines()
cur_dir = os.path.abspath(os.path.dirname(__file__))
with open(os.path.join(cur_dir, 'README.md'), 'rb') as f:
......@@ -58,7 +83,6 @@ extensions = [
"pgl.graph_kernel",
["pgl/graph_kernel.pyx"],
language="c++",
include_dirs=[np.get_include()],
extra_compile_args=compile_extra_args,
extra_link_args=link_extra_args, ),
]
......@@ -66,7 +90,6 @@ extensions = [
def get_package_data(path):
files = []
print(path)
for root, dirnames, filenames in os.walk(path):
for filename in filenames:
files.append(os.path.join(root, filename))
......@@ -83,9 +106,16 @@ setup(
long_description_content_type='text/markdown',
url="https://github.com/PaddlePaddle/PGL",
package_data=package_data,
setup_requires=[
'setuptools>=18.0',
'numpy>=1.16.4',
],
install_requires=requirements,
cmdclass={'build_ext': CustomBuildExt},
packages=find_packages(),
include_package_data=True,
ext_modules=cythonize(extensions),
#ext_modules=cythonize(extensions),
ext_modules=extensions,
classifiers=[
'Intended Audience :: Developers',
'License :: OSI Approved :: Apache Software License',
......
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""scatter test cases"""
import unittest
import numpy as np
import paddle.fluid as fluid
class ScatterAddTest(unittest.TestCase):
"""ScatterAddTest"""
def test_scatter_add(self):
"""test_scatter_add"""
with fluid.dygraph.guard(fluid.CPUPlace()):
input = fluid.dygraph.to_variable(
np.array(
[[1, 2], [5, 6]], dtype='float32'), )
index = fluid.dygraph.to_variable(np.array([1, 1], dtype=np.int32))
updates = fluid.dygraph.to_variable(
np.array(
[[3, 4], [3, 4]], dtype='float32'), )
output = fluid.layers.scatter(input, index, updates, mode='add')
assert output.numpy().tolist() == [[1, 2], [11, 14]]
if __name__ == '__main__':
unittest.main()
# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""unique with counts test"""
import unittest
import numpy as np
import paddle.fluid as fluid
class UniqueWithCountTest(unittest.TestCase):
"""UniqueWithCountTest"""
def _test_unique_with_counts_helper(self, input, output):
place = fluid.CPUPlace()
exe = fluid.Executor(place)
main_program = fluid.Program()
startup_program = fluid.Program()
with fluid.program_guard(main_program, startup_program):
x = fluid.layers.data(
name='input',
dtype='int64',
shape=[-1],
append_batch_size=False)
#x = fluid.assign(np.array([2, 3, 3, 1, 5, 3], dtype='int32'))
out, index, count = fluid.layers.unique_with_counts(x)
out, index, count = exe.run(
main_program,
feed={'input': np.array(
input, dtype='int64'), },
fetch_list=[out, index, count],
return_numpy=True, )
out, index, count = out.tolist(), index.tolist(), count.tolist()
assert [out, index, count] == output
def test_unique_with_counts(self):
"""test_unique_with_counts"""
self._test_unique_with_counts_helper(
input=[1, 1, 2, 4, 4, 4, 7, 8, 8],
output=[
[1, 2, 4, 7, 8],
[0, 0, 1, 2, 2, 2, 3, 4, 4],
[2, 1, 3, 1, 2],
], )
self._test_unique_with_counts_helper(
input=[1],
output=[
[1],
[0],
[1],
], )
self._test_unique_with_counts_helper(
input=[1, 1],
output=[
[1],
[0, 0],
[2],
], )
if __name__ == '__main__':
unittest.main()
......@@ -145,7 +145,7 @@
"source": [
"import paddle.fluid as fluid\n",
"use_cuda = False \n",
"place = fluid.GPUPlace(0) if use_cuda else fluid.CPUPlace()\n",
"place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()\n",
"\n",
"gw = pgl.graph_wrapper.GraphWrapper(name='graph',\n",
" place = place,\n",
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册