diff --git a/README.md b/README.md index 3a58bf618ebcde862336f281b21ca5078dfb7ef0..9c27cbb9ff798fa2a142bbfdb4cd238f75035daa 100644 --- a/README.md +++ b/README.md @@ -1,17 +1,17 @@ -# Paddle Graph Learning (PGL) +The logo of Paddle Graph Learning (PGL) [DOC](https://pgl.readthedocs.io/en/latest/) | [Quick Start](https://pgl.readthedocs.io/en/latest/instruction.html) | [中文](./README.zh.md) -Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle). +Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle). The Framework of Paddle Graph Learning (PGL) -We provide python interfaces for storing/reading/querying graph structured data and two fundamental computational interfaces, which are walk based paradigm and message-passing based paradigm as shown in the above framework of PGL, for building cutting-edge graph learning algorithms. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications. +The newly released PGL supports heterogeneous graph learning on both walk based paradigm and message-passing based paradigm by providing MetaPath sampling and Message Passing mechanism on heterogeneous graph. Furthermor, The newly released PGL also support distributed graph storage and some distributed training algorithms, such as distributed deep walk and distributed graphsage. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications. -## Highlight: Efficient and Flexible Message Passing Paradigm +## Highlight: Efficiency - Support Scatter-Gather and LodTensor Message Passing One of the most important benefits of graph neural networks compared to other models is the ability to use node-to-node connectivity information, but coding the communication between nodes is very cumbersome. At PGL we adopt **Message Passing Paradigm** similar to [DGL](https://github.com/dmlc/dgl) to help to build a customize graph neural network easily. Users only need to write ```send``` and ```recv``` functions to easily implement a simple GCN. As shown in the following figure, for the first step the send function is defined on the edges of the graph, and the user can customize the send function ![](http://latex.codecogs.com/gif.latex?\\phi^e}) to send the message from the source to the target node. For the second step, the recv function ![](http://latex.codecogs.com/gif.latex?\\phi^v}) is responsible for aggregating ![](http://latex.codecogs.com/gif.latex?\\oplus}) messages together from different sources. @@ -19,7 +19,7 @@ One of the most important benefits of graph neural networks compared to other mo The basic idea of message passing paradigm -As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function ![](http://latex.codecogs.com/gif.latex?\\oplus}) on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**. +As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function ![](http://latex.codecogs.com/gif.latex?\\oplus}) on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**. The parallel degree bucketing of PGL @@ -35,10 +35,10 @@ Users only need to call the ```sequence_ops``` functions provided by Paddle to e Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions. -## Performance +### Performance -We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping. +We test all the following GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping. | Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | | -------- | ----- | ----------------- | ------------ | ------------------------------------ | @@ -49,6 +49,7 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t | Citeseer | GCN |70.2%| **0.0045** |0.0046s| | Citeseer | GAT |68.8%| **0.0124s** |0.0139s| + If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) that aggregates neighbor features with LSTM ignoring the order of recieved messages, the optimized message-passing in DGL will be forced to degenerate into degree bucketing scheme. The speed performance will be much slower than the one implemented in PGL. Performances may be various with different scale of the graph, in our experiments, PGL can reach up to 13 times the speed of DGL. | Dataset | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | Speed up| @@ -57,7 +58,49 @@ If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stan | Pubmed | **0.0388s** |0.5275s | 13.59x| | Citeseer | **0.0150s** | 0.1278s | 8.52x | +## Highlight: Flexibility - Natively Support Heterogeneous Graph Learning + +Graph can conveniently represent the relation between things in the real world, but the categories of things and the relation between things are various. Therefore, in the heterogeneous graph, we need to distinguish the node types and edge types in the graph network. PGL models heterogeneous graphs that contain multiple node types and multiple edge types, and can describe complex connections between different types. + +### Support meta path walk sampling on heterogeneous graph + +The metapath sampling in heterogeneous graph +The left side of the figure above describes a shopping social network. The nodes above have two categories of users and goods, and the relations between users and users, users and goods, and goods and goods. The right of the above figure is a simple sampling process of MetaPath. When you input any MetaPath as UPU (user-product-user), you will find the following results +The metapath result +Then on this basis, and introducing word2vec and other methods to support learning metapath2vec and other algorithms of heterogeneous graph representation. + +### Support Message Passing mechanism on heterogeneous graph + +The message passing mechanism on heterogeneous graph +Because of the different node types on the heterogeneous graph, the message delivery is also different. As shown on the left, it has five neighbors, belonging to two different node types. As shown on the right of the figure above, nodes belonging to different types need to be aggregated separately during message delivery, and then merged into the final message to update the target node. On this basis, PGL supports heterogeneous graph algorithms based on message passing, such as GATNE and other algorithms. + + +## Large-Scale: Support distributed graph storage and distributed training algorithms +In most cases of large-scale graph learning, we need distributed graph storage and distributed training support. As shown in the following figure, PGL provided a general solution of large-scale training, we adopted [PaddleFleet](https://github.com/PaddlePaddle/Fleet) as our distributed parameter servers, which supports large scale distributed embeddings and a lightweighted distributed storage engine so tcan easily set up a large scale distributed training algorithm with MPI clusters. + +The distributed frame of PGL + + +## Highlight: Tons of Models +The following are 13 graph learning models that have been implemented in the framework. See the details [here](https://pgl.readthedocs.io/en/latest/introduction.html#highlight-tons-of-models) + +|Model | feature | +|---|---| +| GCN | Graph Convolutional Neural Networks | +| GAT | Graph Attention Network | +| GraphSage |Large-scale graph convolution network based on neighborhood sampling| +| unSup-GraphSage | Unsupervised GraphSAGE | +| LINE | Representation learning based on first-order and second-order neighbors | +| DeepWalk | Representation learning by DFS random walk | +| MetaPath2Vec | Representation learning based on metapath | +| Node2Vec | The representation learning Combined with DFS and BFS | +| Struct2Vec | Representation learning based on structural similarity | +| SGC | Simplified graph convolution neural network | +| GES | The graph represents learning method with node features | +| DGI | Unsupervised representation learning based on graph convolution network | +| GATNE | Representation Learning of Heterogeneous Graph based on MessagePassing | +The above models consists of three parts, namely, graph representation learning, graph neural network and heterogeneous graph learning, which are also divided into graph representation learning and graph neural network. ## System requirements @@ -72,7 +115,7 @@ PGL supports both Python 2 & 3 ## Installation -The current version of PGL is 1.0.0. You can simply install it via pip. +You can simply install it via pip. ```sh pip install pgl diff --git a/README.zh.md b/README.zh.md index 5af657c1e05f7ce5f52cfffe5edf1355cf96c163..7b434dacf11153e2326a962c8a9c22ace3a43e39 100644 --- a/README.zh.md +++ b/README.zh.md @@ -1,4 +1,4 @@ -# Paddle Graph Learning (PGL) +The logo of Paddle Graph Learning (PGL) [文档](https://pgl.readthedocs.io/en/latest/) | [快速开始](https://pgl.readthedocs.io/en/latest/instruction.html) | [English](./README.md) @@ -6,10 +6,10 @@ Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/Padd The Framework of Paddle Graph Learning (PGL) -我们提供一系列的Python接口用于存储/读取/查询图数据结构,并且提供基于游走(Walk Based)以及消息传递(Message Passing)两种计算范式的计算接口(见上图)。利用这些接口,我们就能够轻松的搭建最前沿的图学习算法。结合PaddlePaddle深度学习框架,我们的框架基本能够覆盖大部分的图网络应用,包括图表示学习以及图神经网络。 +在最新发布的PGL中引入了异构图的支持,新增MetaPath采样支持异构图表示学习,新增异构图Message Passing机制支持基于消息传递的异构图算法,利用新增的异构图接口,能轻松搭建前沿的异构图学习算法。而且,在最新发布的PGL中,同时也增加了分布式图存储以及一些分布式图学习训练算法,例如,分布式deep walk和分布式graphsage。结合PaddlePaddle深度学习框架,我们的框架基本能够覆盖大部分的图网络应用,包括图表示学习以及图神经网络。 -## 特色:高效灵活的消息传递范式 +# 特色:高效性——支持Scatter-Gather及LodTensor消息传递 对比于一般的模型,图神经网络模型最大的优势在于它利用了节点与节点之间连接的信息。但是,如何通过代码来实现建模这些节点连接十分的麻烦。PGL采用与[DGL](https://github.com/dmlc/dgl)相似的**消息传递范式**用于作为构建图神经网络的接口。用于只需要简单的编写```send```还有```recv```函数就能够轻松的实现一个简单的GCN网络。如下图所示,首先,send函数被定义在节点之间的边上,用户自定义send函数![](http://latex.codecogs.com/gif.latex?\\phi^e})会把消息从源点发送到目标节点。然后,recv函数![](http://latex.codecogs.com/gif.latex?\\phi^v})负责将这些消息用汇聚函数 ![](http://latex.codecogs.com/gif.latex?\\oplus}) 汇聚起来。 @@ -31,9 +31,7 @@ Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/Padd 尽管DGL用了一些内核融合(kernel fusion)的方法来将常用的sum,max等聚合函数用scatter-gather进行优化。但是对于**复杂的用户定义函数**,他们使用的Degree Bucketing算法,仅仅使用串行的方案来处理不同的分块,并不同充分利用GPU进行加速。然而,在PGL中我们使用基于LodTensor的消息传递能够充分地利用GPU的并行优化,在复杂的用户定义函数下,PGL的速度在我们的实验中甚至能够达到DGL的13倍。即使不使用scatter-gather的优化,PGL仍然有高效的性能表现。当然,我们也是提供了scatter优化的聚合函数。 -## 性能测试 - - +### 性能测试 我们用Tesla V100-SXM2-16G测试了下列所有的GNN算法,每一个算法跑了200个Epoch来计算平均速度。准确率是在测试集上计算出来的,并且我们没有使用Early-stopping策略。 | 数据集 | 模型 | PGL准确率 | PGL速度 (epoch) | DGL 0.3.0 速度 (epoch) | @@ -54,6 +52,52 @@ Paddle Graph Learning (PGL)是一个基于[PaddlePaddle](https://github.com/Padd | Citeseer | **0.0150s** | 0.1278s | 8.52x | +## 特色:易用性——原生支持异构图 + +图可以很方便的表示真实世界中事物之间的联系,但是事物的类别以及事物之间的联系多种多样,因此,在异构图中,我们需要对图网络中的节点类型以及边类型进行区分。PGL针对异构图包含多种节点类型和多种边类型的特点进行建模,可以描述不同类型之间的复杂联系。 + +### 支持异构图MetaPath walk采样 +The metapath sampling in heterogeneous graph +上图左边描述的是一个购物的社交网络,上面的节点有用户和商品两大类,关系有用户和用户之间的关系,用户和商品之间的关系以及商品和商品之间的关系。上图的右边是一个简单的MetaPath采样过程,输入metapath为UPU(user-product-user),采出结果为 +The metapath result +然后在此基础上引入word2vec等方法,支持异构图表示学习metapath2vec等算法。 + +### 支持异构图Message Passing机制 + +The message passing mechanism on heterogeneous graph +在异构图上由于节点类型不同,消息传递也方式也有所不同。如上图左边,它有五个邻居节点,属于两种不同的节点类型。如上图右边,在消息传递的时候需要把属于不同类型的节点分开聚合,然后在合并成最终的消息,从而更新目标节点。在此基础上PGL支持基于消息传递的异构图算法,如GATNE等算法。 + + +## 特色:规模性——支持分布式图存储以及分布式学习算法 + +在大规模的图网络学习中,通常需要多机图存储以及多机分布式训练。如下图所示,PGL提供一套大规模训练的解决方案,我们利用[PaddleFleet](https://github.com/PaddlePaddle/Fleet)(支持大规模分布式Embedding学习)作为我们参数服务器模块以及一套简易的分布式存储方案,可以轻松在MPI集群上搭建分布式大规模图学习方法。 + +The distributed frame of PGL + + +## 特色:丰富性——覆盖业界大部分图学习网络 + +下列是框架中已经自带实现的十三种图网络学习模型 + +| 模型 | 特点 | +|---|---|--- | +| GCN | 图卷积网络 | +| GAT | 基于Attention的图卷积网络 | +| GraphSage | 基于邻居采样的大规模图卷积网络 | +| unSup-GraphSage | 无监督学习的GraphSAGE | +| LINE | 基于一阶、二阶邻居的表示学习 | +| DeepWalk | DFS随机游走的表示学习 | +| MetaPath2Vec | 基于metapath的表示学习 | +| Node2Vec | 结合DFS及BFS的表示学习 | +| Struct2Vec | 基于结构相似的表示学习 | +| SGC | 简化的图卷积网络 | +| GES | 加入节点特征的图表示学习方法 | +| DGI | 基于图卷积网络的无监督表示学习 | +| GATNE | 基于MessagePassing的异构图表示学习 | + +上述模型包含图表示学习,图神经网络以及异构图三部分,而异构图里面也分图表示学习和图神经网络。 + + ## 依赖 PGL依赖于: @@ -67,7 +111,7 @@ PGL支持Python 2和3。 ## 安装 -当前,PGL的版本是1.0.0。你可以简单的用pip进行安装。 +你可以简单的用pip进行安装。 ```sh pip install pgl diff --git a/docs/requirements.txt b/docs/requirements.txt index 8178df62d1c796588ce38714f1978efeea0b9875..4e7960b2517dee1e49324ec1245b1a49595126bc 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -3,6 +3,5 @@ mistune sphinx_rtd_theme numpy >= 1.16.4 cython >= 0.25.2 -networkx paddlepaddle pgl diff --git a/docs/source/_static/distributed_frame.png b/docs/source/_static/distributed_frame.png new file mode 100644 index 0000000000000000000000000000000000000000..5f8728caa19b6b7594bcc4d655391d4e7d00dfab Binary files /dev/null and b/docs/source/_static/distributed_frame.png differ diff --git a/docs/source/_static/framework_of_pgl.png b/docs/source/_static/framework_of_pgl.png index e65f3cd57fc13a7ae93a663988b7e7dacb3f0a65..f8f6293a6821f7a13901f5b83d6d904a4f64ee0d 100644 Binary files a/docs/source/_static/framework_of_pgl.png and b/docs/source/_static/framework_of_pgl.png differ diff --git a/docs/source/_static/him_message_passing.png b/docs/source/_static/him_message_passing.png new file mode 100644 index 0000000000000000000000000000000000000000..52b0286ded75deaa2eac47fd25478099c73008ce Binary files /dev/null and b/docs/source/_static/him_message_passing.png differ diff --git a/docs/source/_static/logo.png b/docs/source/_static/logo.png new file mode 100644 index 0000000000000000000000000000000000000000..031bb27ce03480fab8e44a90bda996b63e43eb05 Binary files /dev/null and b/docs/source/_static/logo.png differ diff --git a/docs/source/_static/message_passing_paradigm.png b/docs/source/_static/message_passing_paradigm.png index c4501c33d2bb116ad18019ff37704f55dae8fe6b..64afa0e9e425024025827bee8360ee1c337792cf 100644 Binary files a/docs/source/_static/message_passing_paradigm.png and b/docs/source/_static/message_passing_paradigm.png differ diff --git a/docs/source/_static/metapath_result.png b/docs/source/_static/metapath_result.png new file mode 100644 index 0000000000000000000000000000000000000000..1f2a323e40cc7c6ecb1d25cc7243be850ffd389f Binary files /dev/null and b/docs/source/_static/metapath_result.png differ diff --git a/docs/source/_static/metapath_sampling.png b/docs/source/_static/metapath_sampling.png new file mode 100644 index 0000000000000000000000000000000000000000..c5ea7b1d50592f5669e827a0f47a9ab5ef0ef856 Binary files /dev/null and b/docs/source/_static/metapath_sampling.png differ diff --git a/docs/source/_static/parallel_degree_bucketing.png b/docs/source/_static/parallel_degree_bucketing.png index 695e097be3e22f2f6cf37ff502d61755eb6597be..fb9d5a4f9ca5ddcd5d48564934aad640e53e3482 100644 Binary files a/docs/source/_static/parallel_degree_bucketing.png and b/docs/source/_static/parallel_degree_bucketing.png differ diff --git a/docs/source/api/pgl.contrib.heter_graph.rst b/docs/source/api/pgl.contrib.heter_graph.rst new file mode 100644 index 0000000000000000000000000000000000000000..e1cc7ef4adeb943f352fad96781ef572606a6b96 --- /dev/null +++ b/docs/source/api/pgl.contrib.heter_graph.rst @@ -0,0 +1,7 @@ +pgl.contrib.heter\_graph module: Heterogenous Graph Storage +=============================== + +.. automodule:: pgl.contrib.heter_graph + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/pgl.contrib.heter_graph_wrapper.rst b/docs/source/api/pgl.contrib.heter_graph_wrapper.rst new file mode 100644 index 0000000000000000000000000000000000000000..74d9fe853df4639ee86470e67aa06c85c167cffe --- /dev/null +++ b/docs/source/api/pgl.contrib.heter_graph_wrapper.rst @@ -0,0 +1,7 @@ +pgl.contrib.heter\_graph\_wrapper module: Heterogenous Graph data holders for Paddle GNN. +========================= + +.. automodule:: pgl.contrib.heter_graph_wrapper + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/pgl.rst b/docs/source/api/pgl.rst index 36da0df57a7e8174f813dd4a4c869fff962d8462..cc581c85b0298d3454bcbab0779427e2b25340d9 100644 --- a/docs/source/api/pgl.rst +++ b/docs/source/api/pgl.rst @@ -9,3 +9,5 @@ API Reference pgl.data_loader pgl.utils.paddle_helper pgl.utils.mp_reader + pgl.contrib.heter_graph + pgl.contrib.heter_graph_wrapper diff --git a/docs/source/conf.py b/docs/source/conf.py index ca7cb8eab16bc3817bcf23df78cfa2c9b68382f0..97098499b6e5577ea0b700a1714438520c8eaf7e 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -40,7 +40,7 @@ copyright = '2019, PaddlePaddle' author = 'PaddlePaddle' # The full version, including alpha/beta/rc tags -release = '1.0.0' +release = '1.0.1' # -- General configuration --------------------------------------------------- @@ -73,13 +73,12 @@ lanaguage = "zh_cn" html_theme = "sphinx_rtd_theme" html_theme_path = [sphinx_rtd_theme.get_html_theme_path()] html_show_sourcelink = False -#html_logo = 'pgl_logo.png' +html_logo = '_static/logo.png' # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] -''' html_theme_options = { 'canonical_url': '', 'analytics_id': 'UA-XXXXXXX-1', # Provided by Google in your dashboard @@ -96,4 +95,3 @@ html_theme_options = { 'includehidden': True, 'titles_only': False } -''' diff --git a/docs/source/examples/dgi_examples.rst b/docs/source/examples/dgi_examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..dc5432dc73c6724b251b32464811bc3db02438c3 --- /dev/null +++ b/docs/source/examples/dgi_examples.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../../examples/dgi/README.md diff --git a/docs/source/examples/distribute_deepwalk_examples.rst b/docs/source/examples/distribute_deepwalk_examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..31350e6db1bb4614fa6f5643badead33b3d0a318 --- /dev/null +++ b/docs/source/examples/distribute_deepwalk_examples.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../../examples/distribute_deepwalk/README.md diff --git a/docs/source/examples/distribute_graphsage_examples.rst b/docs/source/examples/distribute_graphsage_examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..4c1fa78559348b8824cc5a8cafbc83c689386c6d --- /dev/null +++ b/docs/source/examples/distribute_graphsage_examples.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../../examples/distribute_graphsage/README.md diff --git a/docs/source/examples/gat_examples.rst b/docs/source/examples/gat_examples.rst index 2399924cfc8f1ac515843e79ecf8f6826538755c..a94a58bb0c77cce65d7b73784319be2ba30118d7 100644 --- a/docs/source/examples/gat_examples.rst +++ b/docs/source/examples/gat_examples.rst @@ -1 +1 @@ -.. mdinclude:: md/gat_examples.md +.. mdinclude:: ../../../examples/gat/README.md diff --git a/docs/source/examples/gat_examples_code.rst b/docs/source/examples/gat_examples_code.rst deleted file mode 100644 index ac00f2405cf3896160ac1a4162a5d0f7676aef40..0000000000000000000000000000000000000000 --- a/docs/source/examples/gat_examples_code.rst +++ /dev/null @@ -1,8 +0,0 @@ -View the Code -============= - -examples/gat/train.py - -.. literalinclude:: ../../../examples/gat/train.py - :language: python - :linenos: diff --git a/docs/source/examples/gatne_examples.rst b/docs/source/examples/gatne_examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..e638f96fa8f94160959f8e9a09cff9303bee448b --- /dev/null +++ b/docs/source/examples/gatne_examples.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../../examples/GATNE/README.md diff --git a/docs/source/examples/gcn_examples.rst b/docs/source/examples/gcn_examples.rst index b7404df2f741e2c4ea987a1b187056495923993c..6b1089e5076313ee46e878d52c04093f5ac53e62 100644 --- a/docs/source/examples/gcn_examples.rst +++ b/docs/source/examples/gcn_examples.rst @@ -1 +1 @@ -.. mdinclude:: md/gcn_examples.md +.. mdinclude:: ../../../examples/gcn/README.md diff --git a/docs/source/examples/gcn_examples_code.rst b/docs/source/examples/gcn_examples_code.rst deleted file mode 100644 index c83c42f42f326a709f1a0e37695f9187899140fc..0000000000000000000000000000000000000000 --- a/docs/source/examples/gcn_examples_code.rst +++ /dev/null @@ -1,8 +0,0 @@ -View the Code -============= - -examples/gcn/train.py - -.. literalinclude:: ../../../examples/gcn/train.py - :language: python - :linenos: diff --git a/docs/source/examples/ges_examples.rst b/docs/source/examples/ges_examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..a115fb34c827d77c156d9cbbf18a108c5a1cc01a --- /dev/null +++ b/docs/source/examples/ges_examples.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../../examples/ges/README.md diff --git a/docs/source/examples/graphsage_examples.rst b/docs/source/examples/graphsage_examples.rst index ef6c4ed709958444f70252c8c2c12f307ecb3d20..dcb3291fbe6a4b2f57f32329ab1a25c4686e3b7f 100644 --- a/docs/source/examples/graphsage_examples.rst +++ b/docs/source/examples/graphsage_examples.rst @@ -1 +1 @@ -.. mdinclude:: md/graphsage_examples.md +.. mdinclude:: ../../../examples/graphsage/README.md diff --git a/docs/source/examples/graphsage_examples_code.rst b/docs/source/examples/graphsage_examples_code.rst deleted file mode 100644 index 9efd5650891703388595d5f74f6ebf455300f2f3..0000000000000000000000000000000000000000 --- a/docs/source/examples/graphsage_examples_code.rst +++ /dev/null @@ -1,20 +0,0 @@ -View the Code -============= - -examples/graphsage/train.py - -.. literalinclude:: ../../../examples/graphsage/train.py - :language: python - :linenos: - -examples/graphsage/reader.py - -.. literalinclude:: ../../../examples/graphsage/reader.py - :language: python - :linenos: - -examples/graphsage/model.py - -.. literalinclude:: ../../../examples/graphsage/model.py - :language: python - :linenos: diff --git a/docs/source/examples/line_examples.rst b/docs/source/examples/line_examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..3e8f5feec8f3ce7d4a0d38fb7dd1d3df423aa2fe --- /dev/null +++ b/docs/source/examples/line_examples.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../../examples/line/README.md diff --git a/docs/source/examples/md/gat_examples.md b/docs/source/examples/md/gat_examples.md deleted file mode 100644 index fd55c0469afc8e68f515377b9265b7569b35acf3..0000000000000000000000000000000000000000 --- a/docs/source/examples/md/gat_examples.md +++ /dev/null @@ -1,63 +0,0 @@ -# Building Graph Attention Networks - -[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks. -### Simple example to build single head GAT - -To build a gat layer, one can use our pre-defined ```pgl.layers.gat``` or just write a gat layer with message passing interface. -```python -import paddle.fluid as fluid -def gat_layer(graph_wrapper, node_feature, hidden_size): - def send_func(src_feat, dst_feat, edge_feat): - logits = src_feat["a1"] + dst_feat["a2"] - logits = fluid.layers.leaky_relu(logits, alpha=0.2) - return {"logits": logits, "h": src_feat } - - def recv_func(msg): - norm = fluid.layers.sequence_softmax(msg["logits"]) - output = msg["h"] * norm - return output - - h = fluid.layers.fc(node_feature, hidden_size, bias_attr=False, name="hidden") - a1 = fluid.layers.fc(node_feature, 1, name="a1_weight") - a2 = fluid.layers.fc(node_feature, 1, name="a2_weight") - message = graph_wrapper.send(send_func, - nfeat_list=[("h", h), ("a1", a1), ("a2", a2)]) - output = graph_wrapper.recv(recv_func, message) - return output -``` - -### Datasets - -The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907). - -### Dependencies - -- paddlepaddle>=1.6 -- pgl - -### Performance - -We train our models for 200 epochs and report the accuracy on the test dataset. - -| Dataset | Accuracy | -| --- | --- | -| Cora | ~83% | -| Pubmed | ~78% | -| Citeseer | ~70% | - -### How to run - -For examples, use gpu to train gat on cora dataset. -``` -python train.py --dataset cora --use_cuda -``` - -#### Hyperparameters - -- dataset: The citation dataset "cora", "citeseer", "pubmed". -- use_cuda: Use gpu if assign use_cuda. - - -### View the Code - -See the code [here](gat_examples_code.html) diff --git a/docs/source/examples/md/gcn_examples.md b/docs/source/examples/md/gcn_examples.md deleted file mode 100644 index 11d5f4ee88aeb46575da8ee7b26b24bfb837b85a..0000000000000000000000000000000000000000 --- a/docs/source/examples/md/gcn_examples.md +++ /dev/null @@ -1,59 +0,0 @@ -# Building Graph Convolutional Network - - -[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks. - -### Simple example to build GCN - -To build a gcn layer, one can use our pre-defined ```pgl.layers.gcn``` or just write a gcn layer with message passing interface. -```python -import paddle.fluid as fluid -def gcn_layer(graph_wrapper, node_feature, hidden_size, act): - def send_func(src_feat, dst_feat, edge_feat): - return src_feat["h"] - - def recv_func(msg): - return fluid.layers.sequence_pool(msg, "sum") - - message = graph_wrapper.send(send_func, nfeat_list=[("h", node_feature)]) - output = graph_wrapper.recv(recv_func, message) - output = fluid.layers.fc(output, size=hidden_size, act=act) - return output -``` - -### Datasets - -The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907). - -### Dependencies - -- paddlepaddle>=1.6 -- pgl - -### Performance - -We train our models for 200 epochs and report the accuracy on the test dataset. - -| Dataset | Accuracy | -| --- | --- | -| Cora | ~81% | -| Pubmed | ~79% | -| Citeseer | ~71% | - - -### How to run - -For examples, use gpu to train gcn on cora dataset. -``` -python train.py --dataset cora --use_cuda -``` - -#### Hyperparameters - -- dataset: The citation dataset "cora", "citeseer", "pubmed". -- use_cuda: Use gpu if assign use_cuda. - - -### View the Code - -See the code [here](gcn_examples_code.html) diff --git a/docs/source/examples/md/graphsage_examples.md b/docs/source/examples/md/graphsage_examples.md deleted file mode 100644 index 6ad6901fd464752fb47a81a07604e706d59ee56b..0000000000000000000000000000000000000000 --- a/docs/source/examples/md/graphsage_examples.md +++ /dev/null @@ -1,62 +0,0 @@ -# GraphSage for Large Scale Networks - -[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature -information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL. - -### Datasets -The reddit dataset should be downloaded from the following links and placed in directory ```./data```. The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf). - -- reddit.npz https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J -- reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt - - -### Dependencies - -- paddlepaddle>=1.6 -- pgl - -### How to run - -To train a GraphSAGE model on Reddit Dataset, you can just run -``` - python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry -``` - -If you want to train a GraphSAGE model with multiple GPUs, you can just run - -``` -CUDA_VISIBLE_DEVICES=0,1 python train_multi.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --normalize --symmetry --num_trainer 2 -``` - - - -#### Hyperparameters - -- epoch: Number of epochs default (10) -- use_cuda: Use gpu if assign use_cuda. -- graphsage_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm". -- normalize: Normalize the input feature if assign normalize. -- sample_workers: The number of workers for multiprocessing subgraph sample. -- lr: Learning rate. -- symmetry: Make the edges symmetric if assign symmetry. -- batch_size: Batch size. -- samples_1: The max neighbors for the first hop neighbor sampling. (default: 25) -- samples_2: The max neighbors for the second hop neighbor sampling. (default: 10) -- hidden_size: The hidden size of the GraphSAGE models. - - -### Performance - -We train our models for 200 epochs and report the accuracy on the test dataset. - - -| Aggregator | Accuracy | Reported in paper | -| --- | --- | --- | -| Mean | 95.70% | 95.0% | -| Meanpool | 95.60% | 94.8% | -| Maxpool | 94.95% | 94.8% | -| LSTM | 95.13% | 95.4% | - -### View the Code - -See the code [here](graphsage_examples_code.html). diff --git a/docs/source/examples/md/node2vec_examples.md b/docs/source/examples/md/node2vec_examples.md deleted file mode 100644 index 981c39f3f0c3e1cdd33137deeb1e40b935773b2d..0000000000000000000000000000000000000000 --- a/docs/source/examples/md/node2vec_examples.md +++ /dev/null @@ -1,41 +0,0 @@ -# Graph Representation Learning: Node2vec - - -[Node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce node2vec algorithms and reach the same level of indicators as the paper. -## Datasets -The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3) and [Arxiv](http://snap.stanford.edu/data/ca-AstroPh.html). -## Dependencies -- paddlepaddle>=1.6 -- pgl - -## How to run - -For examples, use gpu to train gcn on cora dataset. -```sh -# multiclass task example -python node2vec.py --use_cuda --dataset BlogCatalog --save_path ./tmp/node2vec_BlogCatalog/ --offline_learning --epoch 400 - -python multi_class.py --use_cuda --ckpt_path ./tmp/node2vec_BlogCatalog/paddle_model --epoch 1000 - -# link prediction task example -python node2vec.py --use_cuda --dataset ArXiv --save_path -./tmp/node2vec_ArXiv --offline_learning --epoch 10 - -python link_predict.py --use_cuda --ckpt_path ./tmp/node2vec_ArXiv/paddle_model --epoch 400 -``` -## Hyperparameters -- dataset: The citation dataset "BlogCatalog" and "ArXiv". -- use_cuda: Use gpu if assign use_cuda. - -### Experiment results -Dataset|model|Task|Metric|PGL Result|Reported Result ---|--|--|--|--|-- -BlogCatalog|deepwalk|multi-label classification|MacroF1|0.250|0.211 -BlogCatalog|node2vec|multi-label classification|MacroF1|0.262|0.258 -ArXiv|deepwalk|link prediction|AUC|0.9538|0.9340 -ArXiv|node2vec|link prediction|AUC|0.9541|0.9366 - - -## View the Code - -See the code [here](node2vec_examples_code.html) diff --git a/docs/source/examples/md/static_gat_examples.md b/docs/source/examples/md/static_gat_examples.md deleted file mode 100644 index 7659be190179e88c4739181f7e6b0dbaa9899900..0000000000000000000000000000000000000000 --- a/docs/source/examples/md/static_gat_examples.md +++ /dev/null @@ -1,42 +0,0 @@ -# StaticGraphWrapper for GAT Speed Optimization - -[Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks. - -However, different from the reproduction in **examples/gat**, we use `pgl.graph_wrapper.StaticGraphWrapper` to preload the graph data into gpu or cpu memories which achieves better performance on speed. - - -### Datasets - -The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907). - -### Dependencies - -- paddlepaddle>=1.4 (The speed can be faster in 1.5.) -- pgl - -### Performance - -We train our models for 200 epochs and report the accuracy on the test dataset. - - -| Dataset | Accuracy | epoch time | examples/gat | Improvement | -| --- | --- | --- | --- | --- | -| Cora | ~83% | 0.0119s | 0.0175s | 1.47x | -| Pubmed | ~78% | 0.0193s |0.0295s | 1.53x | -| Citeseer | ~70% | 0.0124s |0.0253s | 2.04x | - -### How to run - -For examples, use gpu to train gat on cora dataset. -```sh -python train.py --dataset cora --use_cuda -``` - -#### Hyperparameters - -- dataset: The citation dataset "cora", "citeseer", "pubmed". -- use_cuda: Use gpu if assign use_cuda. - -### View the Code - -See the code [here](static_gat_examples_code.html) diff --git a/docs/source/examples/md/static_gcn_examples.md b/docs/source/examples/md/static_gcn_examples.md deleted file mode 100644 index 6b0997dcf587ad398e41c37b720275d1edbeb236..0000000000000000000000000000000000000000 --- a/docs/source/examples/md/static_gcn_examples.md +++ /dev/null @@ -1,44 +0,0 @@ -# StaticGraphWrapper for GCN Speed Optimization - -[Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks. - -However, different from the reproduction in **examples/gcn**, we use `pgl.graph_wrapper.StaticGraphWrapper` to preload the graph data into gpu or cpu memories which achieves better performance on speed. - -### Datasets - -The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907). - -### Dependencies - -- paddlepaddle>=1.6 -- pgl - -### Performance - -We train our models for 200 epochs and report the accuracy on the test dataset. - - -| Dataset | Accuracy | epoch time | examples/gcn | Improvement | -| --- | --- | --- |---| --- | --- | -| Cora | ~81% | 0.0047s | 0.0104s | 2.21x | -| Pubmed | ~79% | 0.0049s |0.0154s | 3.14x | -| Citeseer | ~71% | 0.0045s |0.0177s | 3.93x | - - - -### How to run - -For examples, use gpu to train gcn on cora dataset. -```sh -python train.py --dataset cora --use_cuda -``` - -#### Hyperparameters - -- dataset: The citation dataset "cora", "citeseer", "pubmed". -- use_cuda: Use gpu if assign use_cuda. - - -### View the Code - -See the code [here](static_gcn_examples_code.html) diff --git a/docs/source/examples/metapath2vec_examples.rst b/docs/source/examples/metapath2vec_examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..e28d0e27c5c34ed609b9364f4dc91b00d36897ad --- /dev/null +++ b/docs/source/examples/metapath2vec_examples.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../../examples/metapath2vec/README.md diff --git a/docs/source/examples/node2vec_examples.rst b/docs/source/examples/node2vec_examples.rst index a70fdff86103501df0c2fa1a1d8dd11e59ac711b..2a835baf3a818bfba791e27c9d301c36e18d2118 100644 --- a/docs/source/examples/node2vec_examples.rst +++ b/docs/source/examples/node2vec_examples.rst @@ -1 +1 @@ -.. mdinclude:: md/node2vec_examples.md +.. mdinclude:: ../../../examples/node2vec/README.md diff --git a/docs/source/examples/node2vec_examples_code.rst b/docs/source/examples/node2vec_examples_code.rst deleted file mode 100644 index 4483857ab1db7c05c9477bd8af2377d70295c8e6..0000000000000000000000000000000000000000 --- a/docs/source/examples/node2vec_examples_code.rst +++ /dev/null @@ -1,20 +0,0 @@ -View the Code -============= - -examples/node2vec/node2vec.py - -.. literalinclude:: ../../../examples/node2vec/node2vec.py - :language: python - :linenos: - -examples/node2vec/multi_class.py - -.. literalinclude:: ../../../examples/node2vec/multi_class.py - :language: python - :linenos: - -examples/node2vec/link_predict.py - -.. literalinclude:: ../../../examples/node2vec/link_predict.py - :language: python - :linenos: diff --git a/docs/source/examples/sgc_examples.rst b/docs/source/examples/sgc_examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..809c5487513e9492880e80c0eb2507e9f37e12bc --- /dev/null +++ b/docs/source/examples/sgc_examples.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../../examples/sgc/README.md diff --git a/docs/source/examples/static_gat_examples.rst b/docs/source/examples/static_gat_examples.rst index 5b2a8eddedb41c24f930763fcfc7be6ce08afb71..fab497014bfdc7fd456be6c2e029b79f33f4fa9e 100644 --- a/docs/source/examples/static_gat_examples.rst +++ b/docs/source/examples/static_gat_examples.rst @@ -1 +1 @@ -.. mdinclude:: md/static_gat_examples.md +.. mdinclude:: ../../../examples/static_gat/README.md diff --git a/docs/source/examples/static_gat_examples_code.rst b/docs/source/examples/static_gat_examples_code.rst deleted file mode 100644 index 7415c5a9f414ce825f61ced5782f0e9c960e50e7..0000000000000000000000000000000000000000 --- a/docs/source/examples/static_gat_examples_code.rst +++ /dev/null @@ -1,8 +0,0 @@ -View the Code -============= - -examples/static_gat/train.py - -.. literalinclude:: ../../../examples/static_gat/train.py - :language: python - :linenos: diff --git a/docs/source/examples/static_gcn_examples.rst b/docs/source/examples/static_gcn_examples.rst index af896b9b7d35defc6433d0ddf2e78edc750cafb4..d5003020bc41d9706a8d41a94035c481a28af6f2 100644 --- a/docs/source/examples/static_gcn_examples.rst +++ b/docs/source/examples/static_gcn_examples.rst @@ -1 +1 @@ -.. mdinclude:: md/static_gcn_examples.md +.. mdinclude:: ../../../examples/static_gcn/README.md diff --git a/docs/source/examples/static_gcn_examples_code.rst b/docs/source/examples/static_gcn_examples_code.rst deleted file mode 100644 index 6cdedd37b3d65daf4dd07ab76ce5031e86ad7a02..0000000000000000000000000000000000000000 --- a/docs/source/examples/static_gcn_examples_code.rst +++ /dev/null @@ -1,8 +0,0 @@ -View the Code -============= - -examples/static_gcn/train.py - -.. literalinclude:: ../../../examples/static_gcn/train.py - :language: python - :linenos: diff --git a/docs/source/examples/strucvec_examples.rst b/docs/source/examples/strucvec_examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..9725e5f873eb9ddedf8bb5b12ca26709185dfaab --- /dev/null +++ b/docs/source/examples/strucvec_examples.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../../examples/strucvec/README.md diff --git a/docs/source/examples/unsup_graphsage_examples.rst b/docs/source/examples/unsup_graphsage_examples.rst new file mode 100644 index 0000000000000000000000000000000000000000..0280cc8fce69252a33eb6369cb0cc56e84a58320 --- /dev/null +++ b/docs/source/examples/unsup_graphsage_examples.rst @@ -0,0 +1 @@ +.. mdinclude:: ../../../examples/unsup_graphsage/README.md diff --git a/docs/source/index.rst b/docs/source/index.rst index e0565fae4f2ba2cc36d2394788ea267c10bb4b56..fb192ae88c56d3211927fe33ed027afd2527e2f8 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -15,14 +15,9 @@ Quick Start .. toctree:: :maxdepth: 1 :caption: Quick Start - :hidden: - - instruction.rst - -See instruction_ for quick start. - -.. _instruction: instruction.html + quick_start/instruction.rst + quick_start/introduction_for_hetergraph.rst .. toctree:: @@ -34,7 +29,16 @@ See instruction_ for quick start. examples/static_graph_wrapper.rst examples/node2vec_examples.rst examples/graphsage_examples.rst - + examples/dgi_examples.rst + examples/distribute_deepwalk_examples.rst + examples/distribute_graphsage_examples.rst + examples/ges_examples.rst + examples/line_examples.rst + examples/sgc_examples.rst + examples/strucvec_examples.rst + examples/gatne_examples.rst + examples/metapath2vec_examples.rst + examples/unsup_graphsage_examples.rst .. toctree:: :maxdepth: 2 diff --git a/docs/source/md/introduction.md b/docs/source/md/introduction.md index bd42565b410e84661cf18731645dbc131102085f..5dd4b9fc52be80ddac53ec89298b8333584b9479 100644 --- a/docs/source/md/introduction.md +++ b/docs/source/md/introduction.md @@ -1,7 +1,6 @@ -# Paddle Graph Learning (PGL) +# Paddle Graph Learning (PGL) - -Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle). +Paddle Graph Learning (PGL) is an efficient and flexible graph learning framework based on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle).
@@ -9,17 +8,18 @@ Paddle Graph Learning (PGL) is an efficient and flexible graph learning framewor
The Framework of Paddle Graph Learning (PGL)
-We provide python interfaces for storing/reading/querying graph structured data and two fundamental computational interfaces, which are walk based paradigm and message-passing based paradigm as shown in the above framework of PGL, for building cutting-edge graph learning algorithms. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications. +The newly released PGL supports heterogeneous graph learning on both walk based paradigm and message-passing based paradigm by providing MetaPath sampling and Message Passing mechanism on heterogeneous graph. Furthermor, The newly released PGL also support distributed graph storage and some distributed training algorithms, such as distributed deep walk and distributed graphsage. Combined with the PaddlePaddle deep learning framework, we are able to support both graph representation learning models and graph neural networks, and thus our framework has a wide range of graph-based applications. + +## Highlight: Efficiency - Support Scatter-Gather and LodTensor Message Passing -## Highlight: Efficient and Flexible
Message Passing Paradigm +One of the most important benefits of graph neural networks compared to other models is the ability to use node-to-node connectivity information, but coding the communication between nodes is very cumbersome. At PGL we adopt **Message Passing Paradigm** similar to DGL to help to build a customize graph neural network easily. Users only need to write ``send`` and ``recv`` functions to easily implement a simple GCN. As shown in the following figure, for the first step the send function is defined on the edges of the graph, and the user can customize the send function $\phi^e$ to send the message from the source to the target node. For the second step, the recv function $\phi^v$ is responsible for aggregating $\oplus$ messages together from different sources.
The basic idea of message passing paradigm
- -As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function $\oplus$ on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**. +As shown in the left of the following figure, to adapt general user-defined message aggregate functions, DGL uses the degree bucketing method to combine nodes with the same degree into a batch and then apply an aggregate function $\oplus$ on each batch serially. For our PGL UDF aggregate function, we organize the message as a [LodTensor](http://www.paddlepaddle.org/documentation/docs/en/1.4/user_guides/howto/basic_concept/lod_tensor_en.html) in [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) taking the message as variable length sequences. And we **utilize the features of LodTensor in Paddle to obtain fast parallel aggregation**.
@@ -28,21 +28,22 @@ As shown in the left of the following figure, to adapt general user-defined mess
-Users only need to call the ```sequence_ops``` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ```sequence_pool``` to sum the neighbor message. +Users only need to call the ``sequence_ops`` functions provided by Paddle to easily implement efficient message aggregation. For examples, using ``sequence_pool`` to sum the neighbor message. ```python import paddle.fluid as fluid def recv(msg): return fluid.layers.sequence_pool(msg, "sum") ``` -Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions. +Although DGL does some kernel fusion optimization for general sum, max and other aggregate functions with scatter-gather. For **complex user-defined functions** with degree bucketing algorithm, the serial execution for each degree bucket cannot take full advantage of the performance improvement provided by GPU. However, operations on the PGL LodTensor-based message is performed in parallel, which can fully utilize GPU parallel optimization. In our experiments, PGL can reach up to 13 times the speed of DGL with complex user-defined functions. Even without scatter-gather optimization, PGL still has excellent performance. Of course, we still provide build-in scatter-optimized message aggregation functions. -## Performance +### Performance -We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping. -| Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL speed (epoch time) | +We test all the following GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs to get average speeds. And we report the accuracy on test dataset without early stoppping. + +| Dataset | Model | PGL Accuracy | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | | -------- | ----- | ----------------- | ------------ | ------------------------------------ | | Cora | GCN |81.75% | 0.0047s | **0.0045s** | | Cora | GAT | 83.5% | **0.0119s** | 0.0141s | @@ -50,3 +51,95 @@ We test all the GNN algorithms with Tesla V100-SXM2-16G running for 200 epochs t | Pubmed | GAT | 77% |0.0193s|**0.0144s**| | Citeseer | GCN |70.2%| **0.0045** |0.0046s| | Citeseer | GAT |68.8%| **0.0124s** |0.0139s| + + +If we use complex user-defined aggregation like [GraphSAGE-LSTM](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) that aggregates neighbor features with LSTM ignoring the order of recieved messages, the optimized message-passing in DGL will be forced to degenerate into degree bucketing scheme. The speed performance will be much slower than the one implemented in PGL. Performances may be various with different scale of the graph, in our experiments, PGL can reach up to 13 times the speed of DGL. + +| Dataset | PGL speed (epoch time) | DGL 0.3.0 speed (epoch time) | Speed up| +| -------- | ------------ | ------------------------------------ |----| +| Cora | **0.0186s** | 0.1638s | 8.80x| +| Pubmed | **0.0388s** |0.5275s | 13.59x| +| Citeseer | **0.0150s** | 0.1278s | 8.52x | + +## Highlight: Flexibility - Natively Support Heterogeneous Graph Learning + +Graph can conveniently represent the relation between things in the real world, but the categories of things and the relation between things are various. Therefore, in the heterogeneous graph, we need to distinguish the node types and edge types in the graph network. PGL models heterogeneous graphs that contain multiple node types and multiple edge types, and can describe complex connections between different types. + +### Support meta path walk sampling on heterogeneous graph + +
+
+
The metapath sampling in heterogeneous graph
+
+The left side of the figure above describes a shopping social network. The nodes above have two categories of users and goods, and the relations between users and users, users and goods, and goods and goods. The right of the above figure is a simple sampling process of MetaPath. When you input any MetaPath as UPU (user-product-user), you will find the following results +
+
+
The metapath result
+
+Then on this basis, and introducing word2vec and other methods to support learning metapath2vec and other algorithms of heterogeneous graph representation. + +### Support Message Passing mechanism on heterogeneous graph + +
+
+
The message passing mechanism on heterogeneous graph
+
+Because of the different node types on the heterogeneous graph, the message delivery is also different. As shown on the left, it has five neighbors, belonging to two different node types. As shown on the right of the figure above, nodes belonging to different types need to be aggregated separately during message delivery, and then merged into the final message to update the target node. On this basis, PGL supports heterogeneous graph algorithms based on message passing, such as GATNE and other algorithms. + + +## Large-Scale: Support distributed graph storage and distributed training algorithms +In most cases of large-scale graph learning, we need distributed graph storage and distributed training support. As shown in the following figure, PGL provided a general solution of large-scale training, we adopted [PaddleFleet](https://github.com/PaddlePaddle/Fleet) as our distributed parameter servers, which supports large scale distributed embeddings and a lightweighted distributed storage engine so tcan easily set up a large scale distributed training algorithm with MPI clusters. + +
+
+
The distributed frame of PGL
+
+ + +## Highlight: Tons of Models +The following are 13 graph learning models that have been implemented in the framework. + +|Model | feature | +|---|---| +| [GCN](examples/gcn_examples.html)| Graph Convolutional Neural Networks | +| [GAT](examples/gat_examples.html)| Graph Attention Network | +| [GraphSage](examples/graphsage_examples.html)|Large-scale graph convolution network based on neighborhood sampling| +| [unSup-GraphSage](examples/unsup_graphsage_examples.html) | Unsupervised GraphSAGE | +| [LINE](examples/line_examples.html)| Representation learning based on first-order and second-order neighbors | +| [DeepWalk](examples/distribute_deepwalk_examples.html)| Representation learning by DFS random walk | +| [MetaPath2Vec](examples/metapath2vec_examples.html)| Representation learning based on metapath | +| [Node2Vec](examples/node2vec_examples.html)| The representation learning Combined with DFS and BFS | +| [Struct2Vec](examples/strucvec_examples.html)| Representation learning based on structural similarity | +| [SGC](examples/sgc_examples.html)| Simplified graph convolution neural network | +| [GES](examples/ges_examples.html)| The graph represents learning method with node features | +| [DGI](examples/dgi_examples.html)| Unsupervised representation learning based on graph convolution network | +| [GATNE](examples/gatne_examples.html)| Representation Learning of Heterogeneous Graph based on MessagePassing | + +The above models consists of three parts, namely, graph representation learning, graph neural network and heterogeneous graph learning, which are also divided into graph representation learning and graph neural network. + +## System requirements + +PGL requires: + +* paddle >= 1.6 +* cython + + +PGL supports both Python 2 & 3 + + +## Installation + +You can simply install it via pip. + +```sh +pip install pgl +``` + +## The Team + +PGL is developed and maintained by NLP and Paddle Teams at Baidu + +## License + +PGL uses Apache License 2.0. diff --git a/docs/source/quick_start/images/heter_graph_introduction.png b/docs/source/quick_start/images/heter_graph_introduction.png new file mode 100644 index 0000000000000000000000000000000000000000..ca7f6c3d6db1fb8f9e587683c6c100fbb21b6a79 Binary files /dev/null and b/docs/source/quick_start/images/heter_graph_introduction.png differ diff --git a/docs/source/quick_start/images/quick_start_graph.png b/docs/source/quick_start/images/quick_start_graph.png new file mode 100644 index 0000000000000000000000000000000000000000..5c220f3cdf9e03ff809aeeced7c31eb4941ea34a Binary files /dev/null and b/docs/source/quick_start/images/quick_start_graph.png differ diff --git a/docs/source/instruction.rst b/docs/source/quick_start/instruction.rst similarity index 100% rename from docs/source/instruction.rst rename to docs/source/quick_start/instruction.rst diff --git a/docs/source/quick_start/introduction_for_hetergraph.rst b/docs/source/quick_start/introduction_for_hetergraph.rst new file mode 100644 index 0000000000000000000000000000000000000000..1de10f4e60a689301ef07986b3478507afed0acd --- /dev/null +++ b/docs/source/quick_start/introduction_for_hetergraph.rst @@ -0,0 +1,22 @@ +Quick Start with Heterogenous Graph +======================== + +Install PGL +----------- +To install Paddle Graph Learning, we need the following packages. + + +.. code-block:: sh + + paddlepaddle >= 1.6 + cython + +We can simply install pgl by pip. + +.. code-block:: sh + + pip install pgl + +.. mdinclude:: md/quick_start_for_heterGraph.md + + diff --git a/docs/source/md/quick_start.md b/docs/source/quick_start/md/quick_start.md similarity index 99% rename from docs/source/md/quick_start.md rename to docs/source/quick_start/md/quick_start.md index 0a30c39e822cb700b84f4df37df54d7de2e80b1a..fe5415b04d1119f9fbf7380bdb48e48153a9f68a 100644 --- a/docs/source/md/quick_start.md +++ b/docs/source/quick_start/md/quick_start.md @@ -1,6 +1,6 @@ ## Step 1: using PGL to create a graph Suppose we have a graph with 10 nodes and 14 edges as shown in the following figure: -![A simple graph](_static/quick_start_graph.png) +![A simple graph](images/quick_start_graph.png) Our purpose is to train a graph neural network to classify yellow and green nodes. So we can create this graph in such way: ```python diff --git a/docs/source/quick_start/md/quick_start_for_heterGraph.md b/docs/source/quick_start/md/quick_start_for_heterGraph.md new file mode 100644 index 0000000000000000000000000000000000000000..1be9cb228ce965cc8f0b6ae65ffc3738e08d0b2a --- /dev/null +++ b/docs/source/quick_start/md/quick_start_for_heterGraph.md @@ -0,0 +1,168 @@ +## Introduction + +In real world, there exists many graphs contain multiple types of nodes and edges, which we call them Heterogeneous Graphs. Obviously, heterogenous graphs are more complex than homogeneous graphs. + +To deal with such heterogeneous graphs, PGL develops a graph framework to support graph neural network computations and meta-path-based sampling on heterogenous graph. + +The goal of this tutorial: +* example of heterogenous graph data; +* Understand how PGL supports computations in heterogenous graph; +* Using PGL to implement a simple heterogenous graph neural network model to classfiy a particular type of node in a heterogenous graph network. + +## Example of heterogenous graph + +There are a lot of graph data that consists of edges and nodes of multiple types. For example, **e-commerce network** is very common heterogenous graph in real world. It contains at least two types of nodes (user and item) and two types of edges (buy and click). + +The following figure depicts several users click or buy some items. This graph has two types of nodes corresponding to "user" and "item". It also contain two types of edge "buy" and "click". +![A simple heterogenous e-commerce graph](images/heter_graph_introduction.png) + +## Creating a heterogenous graph with PGL + +In heterogenous graph, there exists multiple edges, so we should distinguish them. In PGL, the edges are built in below format: +```python +edges = { + 'click': [(0, 4), (0, 7), (1, 6), (2, 5), (3, 6)], + 'buy': [(0, 5), (1, 4), (1, 6), (2, 7), (3, 5)], + } +``` + +In heterogenous graph, nodes are also of different types. Therefore, you need to mark the type of each node, the format of the node type is as follows: + +```python +node_types = [(0, 'user'), (1, 'user'), (2, 'user'), (3, 'user'), (4, 'item'), + (5, 'item'),(6, 'item'), (7, 'item')] +``` + +Because of the different types of edges, edge features also need to be separated by different types. + +```python +import numpy as np + +num_nodes = len(node_types) + +node_features = {'features': np.random.randn(num_nodes, 8).astype("float32")} + +edge_num_list = [] +for edge_type in edges: + edge_num_list.append(len(edges[edge_type])) + +edge_features = { + 'click': {'h': np.random.randn(edge_num_list[0], 4)}, + 'buy': {'h':np.random.randn(edge_num_list[1], 4)}, +} +``` + +Now, we can build a heterogenous graph by using PGL. + +```python +import paddle.fluid as fluid +import paddle.fluid.layers as fl +import pgl +from pgl.contrib import heter_graph +from pgl.contrib import heter_graph_wrapper + +g = heter_graph.HeterGraph(num_nodes=num_nodes, + edges=edges, + node_types=node_types, + node_feat=node_features, + edge_feat=edge_features) +``` + + + +In PGL, we need to use graph_wrapper as a container for graph data, so here we need to create a graph_wrapper for each type of edge graph. + +```python +place = fluid.CPUPlace() + +# create a GraphWrapper as a container for graph data +gw = heter_graph_wrapper.HeterGraphWrapper(name='heter_graph', + place = place, + edge_types = g.edge_types_info(), + node_feat=g.node_feat_info(), + edge_feat=g.edge_feat_info()) +``` + + + +## MessagePassing + +After building the heterogeneous graph, we can easily carry out the message passing mode. In this case, we have two different types of edges, so we can write a function in such way: + +```python +def message_passing(gw, edge_types, features, name=''): + def __message(src_feat, dst_feat, edge_feat): + return src_feat['h'] + def __reduce(feat): + return fluid.layers.sequence_pool(feat, pool_type='sum') + + assert len(edge_types) == len(features) + output = [] + for i in range(len(edge_types)): + msg = gw[edge_types[i]].send(__message, nfeat_list=[('h', features[i])]) + out = gw[edge_types[i]].recv(msg, __reduce) + output.append(out) + # list of matrix + return output +``` + +```python +edge_types = ['click', 'buy'] +features = [] +for edge_type in edge_types: + features.append(gw[edge_type].node_feat['features']) +output = message_passing(gw, edge_types, features) + +output = fl.concat(input=output, axis=1) + +output = fluid.layers.fc(output, size=4, bias_attr=False, act='relu', name='fc1') +logits = fluid.layers.fc(output, size=1, bias_attr=False, act=None, name='fc2') +``` + + + +## data preprocessing + +In this case, we implement a simple node classifier, we can use 0,1 to represent two classes. + +```python +y = [0,1,0,1,0,1,1,0] +label = np.array(y, dtype="float32").reshape(-1,1) +``` + + + +## Setting up the training program +The training process of the heterogeneous graph node classification model is the same as the training of other paddlepaddle-based models. +* First we build the loss function; +* Second, creating a optimizer; +* Finally, creating a executor and execute the training program. + +```python +node_label = fluid.layers.data("node_label", shape=[None, 1], dtype="float32", append_batch_size=False) + + +loss = fluid.layers.sigmoid_cross_entropy_with_logits(x=logits, label=node_label) + +loss = fluid.layers.mean(loss) + + +adam = fluid.optimizer.Adam(learning_rate=0.01) +adam.minimize(loss) + + +exe = fluid.Executor(place) +exe.run(fluid.default_startup_program()) +feed_dict = gw.to_feed(g) + +for epoch in range(30): + feed_dict['node_label'] = label + + train_loss = exe.run(fluid.default_main_program(), feed=feed_dict, fetch_list=[loss], return_numpy=True) + print('Epoch %d | Loss: %f'%(epoch, train_loss[0])) +``` + + + + + diff --git a/examples/GATNE/Dataset.py b/examples/GATNE/Dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..854e1743fc144b04ee5cf2ef62267a8bef65641c --- /dev/null +++ b/examples/GATNE/Dataset.py @@ -0,0 +1,257 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file loads and preprocesses the dataset for GATNE model. +""" + +import sys +import os +import tqdm +import numpy as np +import logging +import random +from pgl.contrib import heter_graph +import pickle as pkl + + +class Dataset(object): + """Implementation of Dataset class + + This is a simple implementation of loading and processing dataset for GATNE model. + + Args: + config: dict, some configure parameters. + """ + + def __init__(self, config): + self.train_edges_file = config['data_path'] + 'train.txt' + self.valid_edges_file = config['data_path'] + 'valid.txt' + self.test_edges_file = config['data_path'] + 'test.txt' + self.nodes_file = config['data_path'] + 'nodes.txt' + + self.config = config + + self.word2index = self.load_word2index() + + self.build_graph() + + self.valid_data = self.load_test_data(self.valid_edges_file) + self.test_data = self.load_test_data(self.test_edges_file) + + def build_graph(self): + """Build pgl heterogeneous graph. + """ + edge_data_by_type, all_edges, all_nodes = self.load_training_data( + self.train_edges_file, + slf_loop=self.config['slf_loop'], + symmetry_edge=self.config['symmetry_edge']) + + num_nodes = len(all_nodes) + node_features = { + 'index': np.array( + [i for i in range(num_nodes)], dtype=np.int64).reshape(-1, 1) + } + + self.graph = heter_graph.HeterGraph( + num_nodes=num_nodes, + edges=edge_data_by_type, + node_types=None, + node_feat=node_features) + + self.edge_types = sorted(self.graph.edge_types_info()) + logging.info('total %d nodes are loaded' % (self.graph.num_nodes)) + + def load_training_data(self, file_, slf_loop=True, symmetry_edge=True): + """Load train data from file and preprocess them. + + Args: + file_: str, file name for loading data + slf_loop: bool, if true, add self loop edge for every node + symmetry_edge: bool, if true, add symmetry edge for every edge + + """ + logging.info('loading data from %s' % file_) + edge_data_by_type = dict() + all_edges = list() + all_nodes = list() + + with open(file_, 'r') as reader: + for line in reader: + words = line.strip().split(' ') + if words[0] not in edge_data_by_type: + edge_data_by_type[words[0]] = [] + src, dst = words[1], words[2] + edge_data_by_type[words[0]].append((src, dst)) + all_edges.append((src, dst)) + all_nodes.append(src) + all_nodes.append(dst) + + if symmetry_edge: + edge_data_by_type[words[0]].append((dst, src)) + all_edges.append((dst, src)) + + all_nodes = list(set(all_nodes)) + all_edges = list(set(all_edges)) + # edge_data_by_type['Base'] = all_edges + + if slf_loop: + for e_type in edge_data_by_type.keys(): + for n in all_nodes: + edge_data_by_type[e_type].append((n, n)) + + # remapping to index + edges_by_type = {} + for edge_type, edges in edge_data_by_type.items(): + res_edges = [] + for edge in edges: + res_edges.append( + (self.word2index[edge[0]], self.word2index[edge[1]])) + edges_by_type[edge_type] = res_edges + + return edges_by_type, all_edges, all_nodes + + def load_test_data(self, file_): + """Load testing data from file. + """ + logging.info('loading data from %s' % file_) + + true_edge_data_by_type = {} + fake_edge_data_by_type = {} + with open(file_, 'r') as reader: + for line in reader: + words = line.strip().split(' ') + src, dst = self.word2index[words[1]], self.word2index[words[2]] + e_type = words[0] + if int(words[3]) == 1: # true edges + if e_type not in true_edge_data_by_type: + true_edge_data_by_type[e_type] = list() + true_edge_data_by_type[e_type].append((src, dst)) + else: # fake edges + if e_type not in fake_edge_data_by_type: + fake_edge_data_by_type[e_type] = list() + fake_edge_data_by_type[e_type].append((src, dst)) + + return (true_edge_data_by_type, fake_edge_data_by_type) + + def load_word2index(self): + """Load words(nodes) from file and map to index. + """ + word2index = {} + with open(self.nodes_file, 'r') as reader: + for index, line in enumerate(reader): + node = line.strip() + word2index[node] = index + + return word2index + + def generate_walks(self): + """Generate random walks for every edge type. + """ + all_walks = {} + for e_type in self.edge_types: + layer_walks = self.simulate_walks( + edge_type=e_type, + num_walks=self.config['num_walks'], + walk_length=self.config['walk_length']) + + all_walks[e_type] = layer_walks + + return all_walks + + def simulate_walks(self, edge_type, num_walks, walk_length, schema=None): + """Generate random walks in specified edge type. + """ + walks = [] + nodes = list(range(0, self.graph[edge_type].num_nodes)) + + for walk_iter in tqdm.tqdm(range(num_walks)): + random.shuffle(nodes) + for node in nodes: + walk = self.graph[edge_type].random_walk( + [node], max_depth=walk_length - 1) + for i in range(len(walk)): + walks.append(walk[i]) + + return walks + + def generate_pairs(self, all_walks): + """Generate word pairs for training. + """ + logging.info(['edge_types before generate pairs', self.edge_types]) + + pairs = [] + skip_window = self.config['win_size'] // 2 + for layer_id, e_type in enumerate(self.edge_types): + walks = all_walks[e_type] + for walk in tqdm.tqdm(walks): + for i in range(len(walk)): + for j in range(1, skip_window + 1): + if i - j >= 0 and walk[i] != walk[i - j]: + neg_nodes = self.graph[e_type].sample_nodes( + self.config['neg_num']) + pairs.append( + (walk[i], walk[i - j], *neg_nodes, layer_id)) + if i + j < len(walk) and walk[i] != walk[i + j]: + neg_nodes = self.graph[e_type].sample_nodes( + self.config['neg_num']) + pairs.append( + (walk[i], walk[i + j], *neg_nodes, layer_id)) + return pairs + + def fetch_batch(self, pairs, batch_size, for_test=False): + """Produce batch pairs data for training. + """ + np.random.shuffle(pairs) + n_batches = (len(pairs) + (batch_size - 1)) // batch_size + neg_num = len(pairs[0]) - 3 + + result = [] + for i in range(1, n_batches): + batch_pairs = np.array( + pairs[batch_size * (i - 1):batch_size * i], dtype=np.int64) + x = batch_pairs[:, 0].reshape(-1, ).astype(np.int64) + y = batch_pairs[:, 1].reshape(-1, 1, 1).astype(np.int64) + neg = batch_pairs[:, 2:2 + neg_num].reshape(-1, neg_num, + 1).astype(np.int64) + t = batch_pairs[:, -1].reshape(-1, 1).astype(np.int64) + result.append((x, y, neg, t)) + return result + + +if __name__ == "__main__": + config = { + 'data_path': './data/youtube/', + 'train_pairs_file': 'train_pairs.pkl', + 'slf_loop': True, + 'symmetry_edge': True, + 'num_walks': 20, + 'walk_length': 10, + 'win_size': 5, + 'neg_num': 5, + } + + log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s' + logging.basicConfig(level='INFO', format=log_format) + + dataset = Dataset(config) + + logging.info('generating walks') + all_walks = dataset.generate_walks() + logging.info('finishing generate walks') + logging.info(['length of all walks: ', all_walks.keys()]) + + train_pairs = dataset.generate_pairs(all_walks) + pkl.dump(train_pairs, + open(config['data_path'] + config['train_pairs_file'], 'wb')) + logging.info('finishing generate train_pairs') diff --git a/examples/GATNE/README.md b/examples/GATNE/README.md new file mode 100644 index 0000000000000000000000000000000000000000..173410595e1cbf337bb2881f26dcb3a7bec6e851 --- /dev/null +++ b/examples/GATNE/README.md @@ -0,0 +1,47 @@ +# GATNE: General Attributed Multiplex HeTerogeneous Network Embedding +[GATNE](https://arxiv.org/pdf/1905.01669.pdf) is a algorithms framework for embedding large-scale Attributed Multiplex Heterogeneous Networks(AMHN). Given a heterogeneous graph, which consists of nodes and edges of multiple types, it can learn continuous feature representations for every node. Based on PGL, we reproduce GATNE algorithm. + +## Datasets +YouTube dataset contains 2000 nodes, 1310617 edges and 5 edge types. And we use YouTube dataset for example. + +You can dowload YouTube datasets from [here](https://github.com/THUDM/GATNE/tree/master/data) + +After downloading the data, put them, let's say, in ./data/ . Note that the current directory is the root directory of GATNE model. Then in ./data/youtube/ directory, there are three files: +* train.txt +* valid.txt +* test.txt + +Then you can run the below command to preprocess the data. +```sh +python data_process.py --input_file ./data/youtube/train.txt --output_file ./data/youtube/nodes.txt +``` + +## Dependencies +- paddlepaddle>=1.6 +- pgl>=1.0.0 + +## Hyperparameters +All the hyper parameters are saved in config.yaml file. So before training GATNE model, you can open the config.yaml to modify the hyper parameters as you like. + +for example, you can change the \"use_cuda\" to \"True \" in order to use GPU for training or modify \"data_path\" to use different dataset. + +Some important hyper parameters in config.yaml: +- use_cuda: use GPU to train model +- data_path: the directory of dataset +- lr: learning rate +- neg_num: number of negatie samples. +- num_walks: number of walks started from each node +- walk_length: walk length + +## How to run +Then run the below command: +```sh +python main.py -c config.yaml +``` + +### Experiment results +| | PGL result | Reported result | +|:---:|------------|-----------------| +| AUC | 84.83 | 84.61 | +| PR | 82.77 | 81.93 | +| F1 | 76.98 | 76.83 | diff --git a/examples/GATNE/config.yaml b/examples/GATNE/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..a11fcb0036a45773036265a7c1eb2895a7473dd6 --- /dev/null +++ b/examples/GATNE/config.yaml @@ -0,0 +1,38 @@ +task_name: train.gatne +use_cuda: True +log_level: info +seed: 1667 + +optimizer: + type: + args: + lr: 0.005 + +trainer: + type: trainer + args: + epochs: 2 + log_dir: logs/ + save_dir: checkpoints/ + output_dir: outputs/ + +data_loader: + type: Dataset + args: + data_path: ./data/youtube/ + train_pairs_file: train_pairs.pkl + batch_size: 256 + num_walks: 20 + walk_length: 10 + win_size: 5 + neg_num: 5 + slf_loop: True + symmetry_edge: True + +model: + type: GATNE + args: + dimensions: 200 + edge_dim: 32 + att_dim: 32 + att_head: 1 diff --git a/examples/GATNE/data_process.py b/examples/GATNE/data_process.py new file mode 100644 index 0000000000000000000000000000000000000000..e5e411e0c329538e77d4aa0c60392ceb2be491de --- /dev/null +++ b/examples/GATNE/data_process.py @@ -0,0 +1,58 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file preprocess the data before training. +""" + +import sys +import argparse + + +def gen_nodes_file(file_, result_file): + """calculate the total number of nodes and save them for latter processing. + """ + nodes = [] + with open(file_, 'r') as reader: + for line in reader: + tokens = line.strip().split(' ') + nodes.append(tokens[1]) + nodes.append(tokens[2]) + + nodes = list(set(nodes)) + nodes.sort(key=int) + print('total number of nodes: %d' % len(nodes)) + print('saving nodes file in %s' % (result_file)) + with open(result_file, 'w') as writer: + for n in nodes: + writer.write(n + '\n') + + print('finished') + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='GATNE') + parser.add_argument( + '--input_file', + default='./data/youtube/train.txt', + type=str, + help='input file') + parser.add_argument( + '--output_file', + default='./data/youtube/nodes.txt', + type=str, + help='output file') + args = parser.parse_args() + + print('generating nodes file') + gen_nodes_file(args.input_file, args.output_file) diff --git a/examples/GATNE/main.py b/examples/GATNE/main.py new file mode 100644 index 0000000000000000000000000000000000000000..02acd0fe740dd12c184e20e7a2c2a83458f3c568 --- /dev/null +++ b/examples/GATNE/main.py @@ -0,0 +1,307 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file implement the training process of GATNE model. +""" + +import os +import argparse +import time +import numpy as np +import logging +import pickle as pkl + +import pgl +from pgl.utils import paddle_helper +import paddle +import paddle.fluid as fluid +import paddle.fluid.layers as fl + +from utils import * +import Dataset +import model as Model +from sklearn.metrics import (auc, f1_score, precision_recall_curve, + roc_auc_score) + + +def set_seed(seed): + """Set random seed. + """ + random.seed(seed) + np.random.seed(seed) + + +def produce_model(exe, program, dataset, model, feed_dict): + """Output the learned model parameters for testing. + """ + edge_types = dataset.edge_types + num_nodes = dataset.graph[edge_types[0]].num_nodes + edge_types_count = len(edge_types) + neg_num = dataset.config['neg_num'] + + final_model = {} + feed_dict['train_inputs'] = np.array( + [n for n in range(num_nodes)], dtype=np.int64).reshape(-1, ) + feed_dict['train_labels'] = np.array( + [n for n in range(num_nodes)], dtype=np.int64).reshape(-1, 1, 1) + feed_dict['train_negs'] = np.tile(feed_dict['train_labels'], + (1, neg_num)).reshape(-1, neg_num, 1) + + for i in range(edge_types_count): + feed_dict['train_types'] = np.array( + [i for _ in range(num_nodes)], dtype=np.int64).reshape(-1, 1) + edge_node_embed = exe.run(program, + feed=feed_dict, + fetch_list=[model.last_node_embed], + return_numpy=True)[0] + final_model[edge_types[i]] = edge_node_embed + + return final_model + + +def evaluate(final_model, edge_types, data): + """Calculate the AUC score, F1 score and PR score of the final model + """ + edge_types_count = len(edge_types) + AUC, F1, PR = [], [], [] + + true_edge_data_by_type = data[0] + fake_edge_data_by_type = data[1] + + for i in range(edge_types_count): + try: + local_model = final_model[edge_types[i]] + true_edges = true_edge_data_by_type[edge_types[i]] + fake_edges = fake_edge_data_by_type[edge_types[i]] + except Exception as e: + logging.warn('edge type not exists. %s' % str(e)) + continue + tmp_auc, tmp_f1, tmp_pr = calculate_score(local_model, true_edges, + fake_edges) + AUC.append(tmp_auc) + F1.append(tmp_f1) + PR.append(tmp_pr) + + return {'AUC': np.mean(AUC), 'F1': np.mean(F1), 'PR': np.mean(PR)} + + +def calculate_score(model, true_edges, fake_edges): + """Calculate the AUC score, F1 score and PR score of specified edge type + """ + true_list = list() + prediction_list = list() + true_num = 0 + for edge in true_edges: + tmp_score = get_score(model, edge) + if tmp_score is not None: + true_list.append(1) + prediction_list.append(tmp_score) + true_num += 1 + + for edge in fake_edges: + tmp_score = get_score(model, edge) + if tmp_score is not None: + true_list.append(0) + prediction_list.append(tmp_score) + + sorted_pred = prediction_list[:] + sorted_pred.sort() + threshold = sorted_pred[-true_num] + + y_pred = np.zeros(len(prediction_list), dtype=np.int32) + for i in range(len(prediction_list)): + if prediction_list[i] >= threshold: + y_pred[i] = 1 + + y_true = np.array(true_list) + y_scores = np.array(prediction_list) + ps, rs, _ = precision_recall_curve(y_true, y_scores) + return roc_auc_score(y_true, y_scores), f1_score(y_true, y_pred), auc(rs, + ps) + + +def get_score(local_model, edge): + """Calculate the cosine similarity score between two nodes. + """ + try: + vector1 = local_model[edge[0]] + vector2 = local_model[edge[1]] + return np.dot(vector1, vector2) / (np.linalg.norm(vector1) * + np.linalg.norm(vector2)) + except Exception as e: + logging.warn('get_score warning: %s' % str(e)) + return None + pass + + +def run_epoch(epoch, + config, + dataset, + data, + train_prog, + test_prog, + model, + feed_dict, + exe, + for_test=False): + """Run training process of every epoch. + """ + total_loss = [] + for idx, batch_data in enumerate(data): + feed_dict['train_inputs'] = batch_data[0] + feed_dict['train_labels'] = batch_data[1] + feed_dict['train_negs'] = batch_data[2] + feed_dict['train_types'] = batch_data[3] + + loss, lr = exe.run(train_prog, + feed=feed_dict, + fetch_list=[model.loss, model.lr], + return_numpy=True) + total_loss.append(loss[0]) + + if (idx + 1) % 500 == 0: + avg_loss = np.mean(total_loss) + logging.info("epoch %d | step %d | lr %.4f | train_loss %f " % + (epoch, idx + 1, lr, avg_loss)) + total_loss = [] + + return avg_loss + + +def save_model(program, exe, dataset, model, feed_dict, filename): + """Save model. + """ + final_model = produce_model(exe, program, dataset, model, feed_dict) + logging.info('saving model in %s' % (filename)) + pkl.dump(final_model, open(filename, 'wb')) + + +def test(program, exe, dataset, model, feed_dict): + """Testing and validating. + """ + final_model = produce_model(exe, program, dataset, model, feed_dict) + valid_result = evaluate(final_model, dataset.edge_types, + dataset.valid_data) + test_result = evaluate(final_model, dataset.edge_types, dataset.test_data) + + logging.info("valid_AUC %.4f | valid_PR %.4f | valid_F1 %.4f" % + (valid_result['AUC'], valid_result['PR'], valid_result['F1'])) + logging.info("test_AUC %.4f | test_PR %.4f | test_F1 %.4f" % + (test_result['AUC'], test_result['PR'], test_result['F1'])) + + return test_result + + +def main(config): + """main function for training GATNE model. + """ + logging.info(config) + + set_seed(config['seed']) + + dataset = getattr( + Dataset, config['data_loader']['type'])(config['data_loader']['args']) + edge_types = dataset.graph.edge_types_info() + logging.info(['total edge types: ', edge_types]) + + # train_pairs is a list of tuple: [(src1, dst1, neg, e1), (src2, dst2, neg, e2)] + # e(int), edge num count, for select which edge embedding + train_pairs_file = config['data_loader']['args']['data_path'] + \ + config['data_loader']['args']['train_pairs_file'] + if os.path.exists(train_pairs_file): + logging.info('loading train pairs from pkl file %s' % train_pairs_file) + train_pairs = pkl.load(open(train_pairs_file, 'rb')) + else: + logging.info('generating walks') + all_walks = dataset.generate_walks() + logging.info('generating train pairs') + train_pairs = dataset.generate_pairs(all_walks) + logging.info('dumping train pairs to %s' % (train_pairs_file)) + pkl.dump(train_pairs, open(train_pairs_file, 'wb')) + + logging.info('total train pairs: %d' % (len(train_pairs))) + data = dataset.fetch_batch(train_pairs, + config['data_loader']['args']['batch_size']) + + place = fluid.CUDAPlace(0) if config['use_cuda'] else fluid.CPUPlace() + train_program = fluid.Program() + startup_program = fluid.Program() + test_program = fluid.Program() + + with fluid.program_guard(train_program, startup_program): + model = getattr(Model, config['model']['type'])( + config['model']['args'], dataset, place) + + test_program = train_program.clone(for_test=True) + with fluid.program_guard(train_program, startup_program): + global_steps = len(data) * config['trainer']['args']['epochs'] + model.backward(global_steps, config['optimizer']['args']) + + # train + exe = fluid.Executor(place) + exe.run(startup_program) + feed_dict = model.gw.to_feed(dataset.graph) + + logging.info('test before training...') + test(test_program, exe, dataset, model, feed_dict) + logging.info('training...') + for epoch in range(1, 1 + config['trainer']['args']['epochs']): + train_result = run_epoch(epoch, config['trainer']['args'], dataset, + data, train_program, test_program, model, + feed_dict, exe) + + logging.info('validating and testing...') + test_result = test(test_program, exe, dataset, model, feed_dict) + + filename = os.path.join(config['trainer']['args']['save_dir'], + 'dict_embed_model_epoch_%d.pkl' % (epoch)) + save_model(test_program, exe, dataset, model, feed_dict, filename) + + logging.info( + "final_test_AUC %.4f | final_test_PR %.4f | fianl_test_F1 %.4f" % ( + test_result['AUC'], test_result['PR'], test_result['F1'])) + + logging.info('training finished') + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='GATNE') + parser.add_argument( + '-c', + '--config', + default=None, + type=str, + help='config file path (default: None)') + parser.add_argument( + '-n', + '--taskname', + default=None, + type=str, + help='task name(default: None)') + args = parser.parse_args() + + if args.config: + # load config file + config = Config(args.config, isCreate=True, isSave=True) + config = config() + else: + raise AssertionError( + "Configuration file need to be specified. Add '-c config.yaml', for example." + ) + + log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s' + logging.basicConfig( + level=getattr(logging, config['log_level'].upper()), format=log_format) + + main(config) diff --git a/examples/GATNE/model.py b/examples/GATNE/model.py new file mode 100644 index 0000000000000000000000000000000000000000..73775e802fd76f5f68a6d15a24c8decd3fe3654d --- /dev/null +++ b/examples/GATNE/model.py @@ -0,0 +1,245 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file implement the GATNE model. +""" + +import numpy as np +import math +import logging + +import paddle.fluid as fluid +import paddle.fluid.layers as fl +from pgl.contrib import heter_graph_wrapper + + +class GATNE(object): + """Implementation of GATNE model. + + Args: + config: dict, some configure parameters. + dataset: instance of Dataset class + place: GPU or CPU place + """ + + def __init__(self, config, dataset, place): + logging.info(['model is: ', self.__class__.__name__]) + self.config = config + self.graph = dataset.graph + self.placce = place + self.edge_types = sorted(self.graph.edge_types_info()) + logging.info('edge_types in model: %s' % str(self.edge_types)) + neg_num = dataset.config['neg_num'] + + # hyper parameters + self.num_nodes = self.graph.num_nodes + self.embedding_size = self.config['dimensions'] + self.embedding_u_size = self.config['edge_dim'] + self.dim_a = self.config['att_dim'] + self.att_head = self.config['att_head'] + self.edge_type_count = len(self.edge_types) + self.u_num = self.edge_type_count + + self.gw = heter_graph_wrapper.HeterGraphWrapper( + name="heter_graph", + place=place, + edge_types=self.graph.edge_types_info(), + node_feat=self.graph.node_feat_info(), + edge_feat=self.graph.edge_feat_info()) + + self.train_inputs = fl.data( + 'train_inputs', shape=[None], dtype='int64') + + self.train_labels = fl.data( + 'train_labels', shape=[None, 1, 1], dtype='int64') + + self.train_types = fl.data( + 'train_types', shape=[None, 1], dtype='int64') + + self.train_negs = fl.data( + 'train_negs', shape=[None, neg_num, 1], dtype='int64') + + self.forward() + + def forward(self): + """Build the GATNE net. + """ + param_attr_init = fluid.initializer.Uniform( + low=-1.0, high=1.0, seed=np.random.randint(100)) + embed_param_attrs = fluid.ParamAttr( + name='Base_node_embed', initializer=param_attr_init) + + # node_embeddings + base_node_embed = fl.embedding( + input=fl.reshape( + self.train_inputs, shape=[-1, 1]), + size=[self.num_nodes, self.embedding_size], + param_attr=embed_param_attrs) + + node_features = [] + for edge_type in self.edge_types: + param_attr_init = fluid.initializer.Uniform( + low=-1.0, high=1.0, seed=np.random.randint(100)) + embed_param_attrs = fluid.ParamAttr( + name='%s_node_embed' % edge_type, initializer=param_attr_init) + + features = fl.embedding( + input=self.gw[edge_type].node_feat['index'], + size=[self.num_nodes, self.embedding_u_size], + param_attr=embed_param_attrs) + + node_features.append(features) + + # mp_output: list of embedding(self.num_nodes, dim) + mp_output = self.message_passing(self.gw, self.edge_types, + node_features) + + # U : (num_type[m], num_nodes, dim[s]) + node_type_embed = fl.stack(mp_output, axis=0) + + # U : (num_nodes, num_type[m], dim[s]) + node_type_embed = fl.transpose(node_type_embed, perm=[1, 0, 2]) + + #gather node_type_embed from train_inputs + node_type_embed = fl.gather(node_type_embed, self.train_inputs) + + # M_r + trans_weights = fl.create_parameter( + shape=[ + self.edge_type_count, self.embedding_u_size, + self.embedding_size // self.att_head + ], + attr=fluid.initializer.TruncatedNormalInitializer( + loc=0.0, scale=1.0 / math.sqrt(self.embedding_size)), + dtype='float32', + name='trans_w') + + # W_r + trans_weights_s1 = fl.create_parameter( + shape=[self.edge_type_count, self.embedding_u_size, self.dim_a], + attr=fluid.initializer.TruncatedNormalInitializer( + loc=0.0, scale=1.0 / math.sqrt(self.embedding_size)), + dtype='float32', + name='trans_w_s1') + + # w_r + trans_weights_s2 = fl.create_parameter( + shape=[self.edge_type_count, self.dim_a, self.att_head], + attr=fluid.initializer.TruncatedNormalInitializer( + loc=0.0, scale=1.0 / math.sqrt(self.embedding_size)), + dtype='float32', + name='trans_w_s2') + + trans_w = fl.gather(trans_weights, self.train_types) + trans_w_s1 = fl.gather(trans_weights_s1, self.train_types) + trans_w_s2 = fl.gather(trans_weights_s2, self.train_types) + + attention = self.attention(node_type_embed, trans_w_s1, trans_w_s2) + node_type_embed = fl.matmul(attention, node_type_embed) + node_embed = base_node_embed + fl.reshape( + fl.matmul(node_type_embed, trans_w), [-1, self.embedding_size]) + + self.last_node_embed = fl.l2_normalize(node_embed, axis=1) + + nce_weight_initializer = fluid.initializer.TruncatedNormalInitializer( + loc=0.0, scale=1.0 / math.sqrt(self.embedding_size)) + nce_weight_attrs = fluid.ParamAttr( + name='nce_weight', initializer=nce_weight_initializer) + + weight_pos = fl.embedding( + input=self.train_labels, + size=[self.num_nodes, self.embedding_size], + param_attr=nce_weight_attrs) + weight_neg = fl.embedding( + input=self.train_negs, + size=[self.num_nodes, self.embedding_size], + param_attr=nce_weight_attrs) + tmp_node_embed = fl.unsqueeze(self.last_node_embed, axes=[1]) + pos_logits = fl.matmul( + tmp_node_embed, weight_pos, transpose_y=True) # [B, 1, 1] + + neg_logits = fl.matmul( + tmp_node_embed, weight_neg, transpose_y=True) # [B, 1, neg_num] + + pos_score = fl.squeeze(pos_logits, axes=[1]) + pos_score = fl.clip(pos_score, min=-10, max=10) + pos_score = -1.0 * fl.logsigmoid(pos_score) + + neg_score = fl.squeeze(neg_logits, axes=[1]) + neg_score = fl.clip(neg_score, min=-10, max=10) + neg_score = -1.0 * fl.logsigmoid(-1.0 * neg_score) + + neg_score = fl.reduce_sum(neg_score, dim=1, keep_dim=True) + self.loss = fl.reduce_mean(pos_score + neg_score) + + def attention(self, node_type_embed, trans_w_s1, trans_w_s2): + """Calculate attention weights. + """ + attention = fl.tanh(fl.matmul(node_type_embed, trans_w_s1)) + attention = fl.matmul(attention, trans_w_s2) + attention = fl.reshape(attention, [-1, self.u_num]) + attention = fl.softmax(attention) + attention = fl.reshape(attention, [-1, self.att_head, self.u_num]) + return attention + + def message_passing(self, gw, edge_types, features, name=''): + """Message passing from source nodes to dstination nodes + """ + + def __message(src_feat, dst_feat, edge_feat): + """send function + """ + return src_feat['h'] + + def __reduce(feat): + """recv function + """ + return fluid.layers.sequence_pool(feat, pool_type='average') + + if not isinstance(edge_types, list): + edge_types = [edge_types] + + if not isinstance(features, list): + features = [features] + + assert len(edge_types) == len(features) + + output = [] + for i in range(len(edge_types)): + msg = gw[edge_types[i]].send( + __message, nfeat_list=[('h', features[i])]) + neigh_feat = gw[edge_types[i]].recv(msg, __reduce) + neigh_feat = fl.fc(neigh_feat, + size=neigh_feat.shape[-1], + name='neigh_fc_%d' % (i), + act='sigmoid') + slf_feat = fl.fc(features[i], + size=neigh_feat.shape[-1], + name='slf_fc_%d' % (i), + act='sigmoid') + + out = fluid.layers.concat([slf_feat, neigh_feat], axis=1) + out = fl.fc(out, size=neigh_feat.shape[-1], name='fc', act=None) + out = fluid.layers.l2_normalize(out, axis=1) + output.append(out) + + # list of matrix + return output + + def backward(self, global_steps, opt_config): + """Build the optimizer. + """ + self.lr = fl.polynomial_decay(opt_config['lr'], global_steps, 0.001) + adam = fluid.optimizer.Adam(learning_rate=self.lr) + adam.minimize(self.loss) diff --git a/examples/GATNE/utils.py b/examples/GATNE/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..c83a904177819768f4c11c32f2b8515c02f3cefc --- /dev/null +++ b/examples/GATNE/utils.py @@ -0,0 +1,94 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file implement a class for model configure. +""" + +import datetime +import os +import yaml +import random +import shutil + + +class Config(object): + """Implementation of Config class for model configure. + + Args: + config_file(str): configure filename, which is a yaml file. + isCreate(bool): if true, create some neccessary directories to save models, log file and other outputs. + isSave(bool): if true, save config_file in order to record the configure message. + """ + + def __init__(self, config_file, isCreate=False, isSave=False): + self.config_file = config_file + self.config = self.get_config_from_yaml(config_file) + + if isCreate: + self.create_necessary_dirs() + + if isSave: + self.save_config_file() + + def get_config_from_yaml(self, yaml_file): + """Get the configure hyperparameters from yaml file. + """ + try: + with open(yaml_file, 'r') as f: + config = yaml.load(f) + except Exception: + raise IOError("Error in parsing config file '%s'" % yaml_file) + + return config + + def create_necessary_dirs(self): + """Create some necessary directories to save some important files. + """ + + time_stamp = datetime.datetime.now().strftime('%m%d_%H%M') + self.config['trainer']['args']['log_dir'] = ''.join( + (self.config['trainer']['args']['log_dir'], + self.config['task_name'], '/')) # , '.%s/' % (time_stamp))) + self.config['trainer']['args']['save_dir'] = ''.join( + (self.config['trainer']['args']['save_dir'], + self.config['task_name'], '/')) # , '.%s/' % (time_stamp))) + self.config['trainer']['args']['output_dir'] = ''.join( + (self.config['trainer']['args']['output_dir'], + self.config['task_name'], '/')) # , '.%s/' % (time_stamp))) + # if os.path.exists(self.config['trainer']['args']['save_dir']): + # input('save_dir is existed, do you really want to continue?') + + self.make_dir(self.config['trainer']['args']['log_dir']) + self.make_dir(self.config['trainer']['args']['save_dir']) + self.make_dir(self.config['trainer']['args']['output_dir']) + + def save_config_file(self): + """Save config file so that we can know the config when we look back + """ + filename = self.config_file.split('/')[-1] + targetpath = self.config['trainer']['args']['save_dir'] + shutil.copyfile(self.config_file, targetpath + filename) + + def make_dir(self, path): + """Build directory""" + if not os.path.exists(path): + os.makedirs(path) + + def __getitem__(self, key): + """Return the configure dict""" + return self.config[key] + + def __call__(self): + """__call__""" + return self.config diff --git a/examples/dgi/README.md b/examples/dgi/README.md index 34481e06d4576c7969fe68c7a7b6a5f5e319d0c9..40df5dc5ffa5ab9df77cbfcab6bef32ac44575e1 100644 --- a/examples/dgi/README.md +++ b/examples/dgi/README.md @@ -1,4 +1,4 @@ -# PGL Examples for DGI +# DGI: Deep Graph Infomax [Deep Graph Infomax \(DGI\)](https://arxiv.org/abs/1809.10341) is a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. diff --git a/examples/distribute_deepwalk/README.md b/examples/distribute_deepwalk/README.md index 24775aaae8bfbae11c69a68fa6eb49c9863a0ae3..c0c71fb45935093a34d24431d5db01f9d2d16a8d 100644 --- a/examples/distribute_deepwalk/README.md +++ b/examples/distribute_deepwalk/README.md @@ -1,5 +1,6 @@ -# PGL Examples for distributed deepwalk +# Distributed Deepwalk in PGL [Deepwalk](https://arxiv.org/pdf/1403.6652.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce distributed deepwalk algorithms and reach the same level of indicators as the paper. + ## Datasets The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3). ## Dependencies @@ -8,7 +9,9 @@ The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/ ## How to run -For examples, train deepwalk in distributed mode on cora dataset. +We adopt [PaddlePaddle Fleet](https://github.com/PaddlePaddle/Fleet) as our distributed training frameworks ```pgl_deepwalk.cfg``` is config file for deepwalk hyperparameter and ```local_config``` is a config file for parameter servers. By default, we have 2 pservers and 2 trainers. We can use ```cloud_run.sh``` to help you startup the parameter servers and model trainers. + +For examples, train deepwalk in distributed mode on BlogCataLog dataset. ```sh # train deepwalk in distributed mode. sh cloud_run.sh diff --git a/examples/distribute_graphsage/README.md b/examples/distribute_graphsage/README.md index 7e424d4668955476eecf459ade84603b71e26ebe..0ce196f6417b676f8d1853f14c012bd86d5972ef 100644 --- a/examples/distribute_graphsage/README.md +++ b/examples/distribute_graphsage/README.md @@ -55,3 +55,5 @@ python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --sample_w - samples_1: The max neighbors for the first hop neighbor sampling. (default: 25) - samples_2: The max neighbors for the second hop neighbor sampling. (default: 10) - hidden_size: The hidden size of the GraphSAGE models. + + diff --git a/examples/distribute_graphsage/reader.py b/examples/distribute_graphsage/reader.py index 9c230cce3c40161881163a0446ab51977ecc9700..6617b6b86fe08facee1915edcd459a8c706c4191 100644 --- a/examples/distribute_graphsage/reader.py +++ b/examples/distribute_graphsage/reader.py @@ -89,8 +89,7 @@ def worker(batch_info, graph_wrapper, samples): if len(start_nodes) == 0: break - subgraph = graph.subgraph( - nodes=nodes, eid=eids, edges=[eid2edges[e] for e in eids]) + subgraph = graph.subgraph(nodes=nodes, eid=eids, edges=[ eid2edges[e] for e in eids]) sub_node_index = subgraph.reindex_from_parrent_nodes( batch_train_samples) feed_dict = graph_wrapper.to_feed(subgraph) @@ -103,7 +102,8 @@ def worker(batch_info, graph_wrapper, samples): return work -def multiprocess_graph_reader(graph_wrapper, +def multiprocess_graph_reader( + graph_wrapper, samples, node_index, batch_size, @@ -138,7 +138,7 @@ def multiprocess_graph_reader(graph_wrapper, reader_pool = [] for i in range(num_workers): reader_pool.append( - worker(batch_info[block_size * i:block_size * (i + 1)], + worker(batch_info[block_size * i:block_size * (i + 1)], graph_wrapper, samples)) multi_process_sample = mp_reader.multiprocess_reader( reader_pool, use_pipe=True, queue_size=1000) @@ -146,3 +146,4 @@ def multiprocess_graph_reader(graph_wrapper, return paddle.reader.buffered(r, 1000) return reader() + diff --git a/examples/distribute_graphsage/requirements.txt b/examples/distribute_graphsage/requirements.txt index bfc094c11eb1cebe90f8acbc4c399eb688d9c7cb..7bda67a20635218a8786cfb872cfd2da5b2ddbe1 100644 --- a/examples/distribute_graphsage/requirements.txt +++ b/examples/distribute_graphsage/requirements.txt @@ -1,3 +1,4 @@ scipy redis==2.10.6 redis-py-cluster==1.3.6 + diff --git a/examples/distribute_graphsage/train.py b/examples/distribute_graphsage/train.py index cb62acf8c20b1e63126c6ff20687d6e59f597d7f..fa52e3e002b52e14db5ea4e893377476eada41ef 100644 --- a/examples/distribute_graphsage/train.py +++ b/examples/distribute_graphsage/train.py @@ -170,9 +170,7 @@ def main(args): with fluid.program_guard(train_program, startup_program): graph_wrapper = pgl.graph_wrapper.GraphWrapper( - "sub_graph", - fluid.CPUPlace(), - node_feat=[('feats', [None, 602], np.dtype('float32'))]) + "sub_graph", fluid.CPUPlace(), node_feat=[('feats', [None, 602], np.dtype('float32'))]) model_loss, model_acc = build_graph_model( graph_wrapper, num_class=data["num_class"], diff --git a/examples/gat/README.md b/examples/gat/README.md index 573316ea729a03441d47f7a54fe83163142cf96c..4baefbb7a63abff0a90431a123e8f2b340ac124a 100644 --- a/examples/gat/README.md +++ b/examples/gat/README.md @@ -1,4 +1,4 @@ -# PGL Examples for GAT +# GAT: Graph Attention Networks [Graph Attention Networks \(GAT\)](https://arxiv.org/abs/1710.10903) is a novel architectures that operate on graph-structured data, which leverages masked self-attentional layers to address the shortcomings of prior methods based on graph convolutions or their approximations. Based on PGL, we reproduce GAT algorithms and reach the same level of indicators as the paper in citation network benchmarks. ### Simple example to build single head GAT diff --git a/examples/gcn/README.md b/examples/gcn/README.md index 1a1cf4fc54b3c7232b8c3b1cb0516b34de05240d..6812d4d4e71bfe027ef90e417e38ac48ab0bba47 100644 --- a/examples/gcn/README.md +++ b/examples/gcn/README.md @@ -1,4 +1,4 @@ -# PGL Examples for GCN +# GCN: Graph Convolutional Networks [Graph Convolutional Network \(GCN\)](https://arxiv.org/abs/1609.02907) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce GCN algorithms and reach the same level of indicators as the paper in citation network benchmarks. diff --git a/examples/ges/README.md b/examples/ges/README.md index 5636cb13c2eec606141b469f3b1a13e347ae8bda..0f4479750936a540e3af42faf712cbfb734b4045 100644 --- a/examples/ges/README.md +++ b/examples/ges/README.md @@ -1,4 +1,4 @@ -# PGL Examples for GES +# GES: Graph Embedding with Side Information [Graph Embedding with Side Information](https://arxiv.org/pdf/1803.02349.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce ges algorithms. ## Datasets The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3). diff --git a/examples/graphsage/README.md b/examples/graphsage/README.md index 449aaf8c72112eb979ec916f104a0c29b06f1064..945876acb7eba5ae5c7b801fb3a69ed450362391 100644 --- a/examples/graphsage/README.md +++ b/examples/graphsage/README.md @@ -1,4 +1,4 @@ -# GraphSAGE in PGL +# GraphSAGE: Inductive Representation Learning on Large Graphs [GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL. diff --git a/examples/line/README.md b/examples/line/README.md index abbe0d4e28e6ae47e72e6aeec6a7b70f04ae988d..53066e894dec4b7b2ff8e9e7eb9a0e26be253760 100644 --- a/examples/line/README.md +++ b/examples/line/README.md @@ -1,4 +1,4 @@ -# PGL Examples for LINE +# LINE: Large-scale Information Network Embedding [LINE](http://www.www2015.it/documents/proceedings/proceedings/p1067.pdf) is an algorithmic framework for embedding very large-scale information networks. It is suitable to a variety of networks including directed, undirected, binary or weighted edges. Based on PGL, we reproduce LINE algorithms and reach the same level of indicators as the paper. ## Datasets @@ -36,7 +36,7 @@ For examples, use gpu to train LINE on Flickr dataset. # multiclass task example python line.py --use_cuda --order first_order --data_path ./data/flickr/ --save_dir ./checkpoints/model/ -python multi_class.py --ckpt_path ./checkpoints/model/model_eopch_20 --percent 0.5 +python multi_class.py --ckpt_path ./checkpoints/model/model_epoch_20 --percent 0.5 ``` diff --git a/examples/line/line.py b/examples/line/line.py index a6e17f6d20122f806cdc27e88abdfb1585fc99a2..4c7cc9b5540c2399b376a57fe810d9af3ba0119c 100644 --- a/examples/line/line.py +++ b/examples/line/line.py @@ -42,6 +42,16 @@ def make_dir(path): raise +def save_param(dirname, var_name_list): + """save_param""" + if not os.path.exists(dirname): + os.makedirs(dirname) + for var_name in var_name_list: + var = fluid.global_scope().find_var(var_name) + var_tensor = var.get_tensor() + np.save(os.path.join(dirname, var_name + '.npy'), np.array(var_tensor)) + + def set_seed(seed): """Set global random seed. """ @@ -153,9 +163,9 @@ def main(args): # save parameters in every epoch log.info("saving persistables parameters...") - fluid.io.save_persistables(exe, - os.path.join(args.save_dir, "model_epoch_%d" - % (epoch + 1)), main_program) + cur_save_path = os.path.join(args.save_dir, + "model_epoch_%d" % (epoch + 1)) + save_param(cur_save_path, ['shared_w']) if __name__ == '__main__': diff --git a/examples/line/multi_class.py b/examples/line/multi_class.py index 3f191f23504359995c3a9f279c9948ff5402b15c..00e616594c58fc05e81010a6df0fe4cd9017caa7 100644 --- a/examples/line/multi_class.py +++ b/examples/line/multi_class.py @@ -33,6 +33,15 @@ from pgl.utils.logger import log from data_loader import FlickrDataset +def load_param(dirname, var_name_list): + """load_param""" + for var_name in var_name_list: + var = fluid.global_scope().find_var(var_name) + var_tensor = var.get_tensor() + var_tmp = np.load(os.path.join(dirname, var_name + '.npy')) + var_tensor.set(var_tmp, fluid.CPUPlace()) + + def set_seed(seed): """Set global random seed. """ @@ -200,12 +209,15 @@ def main(args): return False return os.path.exists(os.path.join(args.ckpt_path, var.name)) - fluid.io.load_vars( - exe, args.ckpt_path, main_program=train_prog, predicate=existed_params) + log.info('loading pretrained parameters from npy') + load_param(args.ckpt_path, ['shared_w']) + step = 0 prev_time = time.time() train_model['pyreader'].start() + final_macro_f1 = 0.0 + final_micro_f1 = 0.0 while 1: try: train_loss_val, train_probs_val, train_labels_val, train_topk_val = exe.run( @@ -257,8 +269,13 @@ def main(args): log.info("\t\tStep %d " % step + "Test Loss: %f " % test_loss_val + "Test Macro F1: %f " % test_macro_f1 + "Test Micro F1: %f " % test_micro_f1) + final_macro_f1 = max(test_macro_f1, final_macro_f1) + final_micro_f1 = max(test_micro_f1, final_micro_f1) break + log.info("\nFinal test Macro F1: %f " % final_macro_f1 + + "Final test Micro F1: %f " % final_micro_f1) + if __name__ == '__main__': parser = argparse.ArgumentParser(description='LINE') @@ -268,7 +285,7 @@ if __name__ == '__main__': default='./data/flickr/', help='dataset for training') parser.add_argument("--use_cuda", action='store_true', help="use_cuda") - parser.add_argument("--epochs", type=int, default=10) + parser.add_argument("--epochs", type=int, default=5) parser.add_argument("--seed", type=int, default=1667) parser.add_argument( "--lr", type=float, default=0.025, help='learning rate') diff --git a/examples/metapath2vec/Dataset.py b/examples/metapath2vec/Dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..5bf97b548f0979adf61545d89b26aeb0b1d48b75 --- /dev/null +++ b/examples/metapath2vec/Dataset.py @@ -0,0 +1,196 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file loads and preprocesses the dataset for metapath2vec model. +""" + +import sys +import os +import glob +import numpy as np +import tqdm +import time +import logging +import random +from pgl.contrib import heter_graph +import pickle as pkl + + +class Dataset(object): + """Implementation of Dataset class + + This is a simple implementation of loading and processing dataset for metapath2vec model. + + Args: + config: dict, some configure parameters. + """ + + NEGATIVE_TABLE_SIZE = 1e8 + + def __init__(self, config): + self.config = config + self.walk_files = config['input_path'] + config['walk_path'] + self.word2id_file = config['input_path'] + config['word2id_file'] + + self.word2freq = {} + self.word2id = {} + self.id2word = {} + self.sentences_count = 0 + self.token_count = 0 + self.negatives = [] + self.discards = [] + + logging.info('reading sentences') + self.read_words() + logging.info('initializing discards') + self.initDiscards() + logging.info('initializing negatives') + self.initNegatives() + + def read_words(self): + """Read words(nodes) from walk files which are produced by sampler. + """ + word_freq = dict() + for walk_file in glob.glob(self.walk_files): + with open(walk_file, 'r') as reader: + for walk in reader: + walk = walk.strip().split(' ') + if len(walk) > 1: + self.sentences_count += 1 + for word in walk: + self.token_count += 1 + word_freq[word] = word_freq.get(word, 0) + 1 + + wid = 0 + logging.info('Read %d sentences.' % self.sentences_count) + logging.info('Read %d words.' % self.token_count) + logging.info('%d words have been sampled.' % len(word_freq)) + for w, c in word_freq.items(): + if c < self.config['min_count']: + continue + self.word2id[w] = wid + self.id2word[wid] = w + self.word2freq[wid] = c + wid += 1 + + self.word_count = len(self.word2id) + logging.info( + '%d words displayed less than %d(min_count) have been discarded.' % + (len(word_freq) - len(self.word2id), self.config['min_count'])) + + pkl.dump(self.word2id, open(self.word2id_file, 'wb')) + + def initDiscards(self): + """Get a frequency table for sub-sampling. + """ + t = 0.0001 + f = np.array(list(self.word2freq.values())) / self.token_count + self.discards = np.sqrt(t / f) + (t / f) + + def initNegatives(self): + """Get a table for negative sampling + """ + pow_freq = np.array(list(self.word2freq.values()))**0.75 + words_pow = sum(pow_freq) + ratio = pow_freq / words_pow + count = np.round(ratio * Dataset.NEGATIVE_TABLE_SIZE) + for wid, c in enumerate(count): + self.negatives += [wid] * int(c) + self.negatives = np.array(self.negatives) + np.random.shuffle(self.negatives) + self.sampling_prob = ratio + + def getNegatives(self, size): + """Get negative samples from negative samling table. + """ + return np.random.choice(self.negatives, size) + + def walk_from_files(self, walkpath_files): + """Generate walks from files. + """ + bucket = [] + for filename in walkpath_files: + with open(filename) as reader: + for line in reader: + words = line.strip().split(' ') + if len(words) > 1: + word_ids = [ + self.word2id[w] for w in words if w in self.word2id + ] + bucket.append(word_ids) + if len(bucket) == self.config['batch_size']: + yield bucket + bucket = [] + if len(bucket): + yield bucket + + def pairs_generator(self, walkpath_files): + """Generate train pairs(src, pos, negs) for training model. + """ + + def wrapper(): + """wrapper for multiprocess calling. + """ + for walks in self.walk_from_files(walkpath_files): + res = self.gen_pairs(walks) + yield res + + return wrapper + + def gen_pairs(self, walks): + """Generate train pairs data for training model. + """ + src = [] + pos = [] + negs = [] + skip_window = self.config['win_size'] // 2 + for walk in walks: + for i in range(len(walk)): + for j in range(1, skip_window + 1): + if i - j >= 0: + src.append(walk[i]) + pos.append(walk[i - j]) + negs.append( + self.getNegatives(size=self.config['neg_num'])) + if i + j < len(walk): + src.append(walk[i]) + pos.append(walk[i + j]) + negs.append( + self.getNegatives(size=self.config['neg_num'])) + + src = np.array(src, dtype=np.int64).reshape(-1, 1, 1) + pos = np.array(pos, dtype=np.int64).reshape(-1, 1, 1) + negs = np.expand_dims(np.array(negs, dtype=np.int64), -1) + return {"src": src, "pos": pos, "negs": negs} + + +if __name__ == "__main__": + config = { + 'input_path': './data/out_aminer_CPAPC/', + 'walk_path': 'aminer_walks_CPAPC_500num_100len/*', + 'author_label_file': 'author_label.txt', + 'venue_label_file': 'venue_label.txt', + 'remapping_author_label_file': 'multi_class_author_label.txt', + 'remapping_venue_label_file': 'multi_class_venue_label.txt', + 'word2id_file': 'word2id.pkl', + 'win_size': 7, + 'neg_num': 5, + 'min_count': 2, + 'batch_size': 1, + } + + log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s' + logging.basicConfig(level=getattr(logging, 'INFO'), format=log_format) + + dataset = Dataset(config) diff --git a/examples/metapath2vec/README.md b/examples/metapath2vec/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f7708391fa77ee0e736fc2b8deda1653eaef6ad4 --- /dev/null +++ b/examples/metapath2vec/README.md @@ -0,0 +1,47 @@ +# metapath2vec: Scalable Representation Learning for Heterogeneous Networks +[metapath2vec](https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf) is a algorithm framework for representation learning in heterogeneous networks which contains multiple types of nodes and links. Given a heterogeneous graph, metapath2vec algorithm first generates meta-path-based random walks and then use skipgram model to train a language model. Based on PGL, we reproduce metapath2vec algorithm. + + +## Datasets +You can dowload datasets from [here](https://ericdongyx.github.io/metapath2vec/m2v.html) + +We use the "aminer" data for example. After downloading the aminer data, put them, let's say, in ./data/net_aminer/ . We also need to put "label/" directory in ./data/. + +## Dependencies +- paddlepaddle>=1.6 +- pgl>=1.0.0 + +## Hyperparameters +All the hyper parameters are saved in config.yaml file. So before training, you can open the config.yaml to modify the hyper parameters as you like. + +for example, you can change the \"use_cuda\" to \"True \" in order to use GPU for training or modify \"data_path\" to specify the data you want. + +Some important hyper parameters in config.yaml: +- **use_cuda**: use GPU to train model +- **data_path**: the directory of dataset that you want to load +- **lr**: learning rate +- **neg_num**: number of negative samples. +- **num_walks**: number of walks started from each node +- **walk_length**: walk length +- **metapath**: meta path scheme + +## Metapath randomwalk sampling +Before training, we should generate some metapath random walks to train skipgram model. we can run the below command to produce metapath randomwalk data. +```sh +python sample.py -c config.yaml +``` + +## Training and Testing +After finishing metapath randomwalk sampling, you can run the below command to train and test the model. +```sh +python main.py -c config.yaml + +python multi_class.py --dataset ./data/out_aminer_CPAPC/author_label.txt --word2id ./checkpoints/train.metapath2vec/word2id.pkl --ckpt_path ./checkpoints/train.metapath2vec/model_epoch5/ + +``` + +## Experiment results +| train_percent | Metric | PGL Result | Reported Result | +|---------------|----------|------------|-----------------| +| 50% | macro-F1 | 0.9249 | 0.9314 | +| 50% | micro-F1 | 0.9283 | 0.9365 | diff --git a/examples/metapath2vec/config.yaml b/examples/metapath2vec/config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..aa91e72455e55e185f083a785335817dc26e66a6 --- /dev/null +++ b/examples/metapath2vec/config.yaml @@ -0,0 +1,49 @@ +task_name: train.metapath2vec +use_cuda: True +log_level: info +seed: 1667 + +sampler: + type: + args: + data_path: ./data/net_aminer/ + author_label_file: ./data/label/googlescholar.8area.author.label.txt + venue_label_file: ./data/label/googlescholar.8area.venue.label.txt + output_path: ./data/out_aminer_CPAPC/ + new_author_label_file: author_label.txt + new_venue_label_file: venue_label.txt + walk_saved_path: walks/ + num_walks: 1000 + walk_length: 100 + metapath: conf-paper-author-paper-conf + +optimizer: + type: Adam + args: + lr: 0.005 + end_lr: 0.0001 + +trainer: + type: trainer + args: + epochs: 5 + log_dir: logs/ + save_dir: checkpoints/ + output_dir: outputs/ + num_sample_workers: 8 + +data_loader: + type: Dataset + args: + input_path: ./data/out_aminer_CPAPC/ # same path as output_path in sampler + walk_path: walks/* + word2id_file: word2id.pkl + batch_size: 32 + win_size: 7 # default: 7 + neg_num: 5 + min_count: 10 + +model: + type: SkipgramModel + args: + embed_dim: 128 diff --git a/examples/metapath2vec/main.py b/examples/metapath2vec/main.py new file mode 100644 index 0000000000000000000000000000000000000000..3d97ca0d8b251790f0e33aa8f5e767d0370c8540 --- /dev/null +++ b/examples/metapath2vec/main.py @@ -0,0 +1,187 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file implement the training process of metapath2vec model. +""" +import os +import sys +import argparse +import time +import numpy as np +import logging +import pickle as pkl +import shutil +import glob + +import pgl +from pgl.utils import paddle_helper +import paddle +import paddle.fluid as fluid +import paddle.fluid.layers as fl + +from utils import * +import Dataset +import model as Models +from pgl.utils import mp_reader +from sklearn.metrics import (auc, f1_score, precision_recall_curve, + roc_auc_score) + + +def set_seed(seed): + """Set global random seed.""" + random.seed(seed) + np.random.seed(seed) + + +def save_param(dirname, var_name_list): + """save_param""" + if not os.path.exists(dirname): + os.makedirs(dirname) + + for var_name in var_name_list: + var = fluid.global_scope().find_var(var_name) + var_tensor = var.get_tensor() + np.save(os.path.join(dirname, var_name + '.npy'), np.array(var_tensor)) + + +def multiprocess_data_generator(config, dataset): + """Using multiprocess to generate training data. + """ + num_sample_workers = config['trainer']['args']['num_sample_workers'] + + walkpath_files = [[] for i in range(num_sample_workers)] + for idx, f in enumerate(glob.glob(dataset.walk_files)): + walkpath_files[idx % num_sample_workers].append(f) + + gen_data_pool = [ + dataset.pairs_generator(files) for files in walkpath_files + ] + if num_sample_workers == 1: + gen_data_func = gen_data_pool[0] + else: + gen_data_func = mp_reader.multiprocess_reader( + gen_data_pool, use_pipe=True, queue_size=100) + + return gen_data_func + + +def run_epoch(epoch, + config, + data_generator, + train_prog, + model, + feed_dict, + exe, + for_test=False): + """Run training process of every epoch. + """ + total_loss = [] + for idx, batch_data in enumerate(data_generator()): + feed_dict['train_inputs'] = batch_data['src'] + feed_dict['train_labels'] = batch_data['pos'] + feed_dict['train_negs'] = batch_data['negs'] + + loss, lr = exe.run(train_prog, + feed=feed_dict, + fetch_list=[model.loss, model.lr], + return_numpy=True) + total_loss.append(loss[0]) + + if (idx + 1) % 500 == 0: + avg_loss = np.mean(total_loss) + logging.info("epoch %d | step %d | lr %.4f | train_loss %f " % + (epoch, idx + 1, lr, avg_loss)) + total_loss = [] + + +def main(config): + """main function for training metapath2vec model. + """ + logging.info(config) + + set_seed(config['seed']) + + dataset = getattr( + Dataset, config['data_loader']['type'])(config['data_loader']['args']) + data_generator = multiprocess_data_generator(config, dataset) + + # move word2id file to checkpoints directory + src_word2id_file = dataset.word2id_file + dst_wor2id_file = config['trainer']['args']['save_dir'] + config[ + 'data_loader']['args']['word2id_file'] + logging.info('backup word2id file to %s' % dst_wor2id_file) + shutil.move(src_word2id_file, dst_wor2id_file) + + place = fluid.CUDAPlace(0) if config['use_cuda'] else fluid.CPUPlace() + train_program = fluid.Program() + startup_program = fluid.Program() + + with fluid.program_guard(train_program, startup_program): + model = getattr(Models, config['model']['type'])( + dataset=dataset, config=config['model']['args'], place=place) + + with fluid.program_guard(train_program, startup_program): + global_steps = int(dataset.sentences_count * + config['trainer']['args']['epochs'] / + config['data_loader']['args']['batch_size']) + model.backward(global_steps, config['optimizer']['args']) + + # train + exe = fluid.Executor(place) + exe.run(startup_program) + feed_dict = {} + + logging.info('training...') + for epoch in range(1, 1 + config['trainer']['args']['epochs']): + run_epoch(epoch, config['trainer']['args'], data_generator, + train_program, model, feed_dict, exe) + + logging.info('saving model...') + cur_save_path = os.path.join(config['trainer']['args']['save_dir'], + "model_epoch%d" % (epoch)) + save_param(cur_save_path, ['content']) + + logging.info('finishing training') + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='metapath2vec') + parser.add_argument( + '-c', + '--config', + default=None, + type=str, + help='config file path (default: None)') + parser.add_argument( + '-n', + '--taskname', + default=None, + type=str, + help='task name(default: None)') + args = parser.parse_args() + + if args.config: + # load config file + config = Config(args.config, isCreate=True, isSave=True) + config = config() + else: + raise AssertionError( + "Configuration file need to be specified. Add '-c config.yaml', for example." + ) + + log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s' + logging.basicConfig( + level=getattr(logging, config['log_level'].upper()), format=log_format) + + main(config) diff --git a/examples/metapath2vec/model.py b/examples/metapath2vec/model.py new file mode 100644 index 0000000000000000000000000000000000000000..be199f8f3f19e15a09f350040ef5f8ab5b735fb8 --- /dev/null +++ b/examples/metapath2vec/model.py @@ -0,0 +1,114 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file implement the skipgram model for training metapath2vec. +""" + +import argparse +import time +import math +import os +import io +from multiprocessing import Pool +import logging +import numpy as np +import glob + +import pgl +from pgl import data_loader +from pgl.utils import op +from pgl.utils.logger import log +import paddle.fluid as fluid +import paddle.fluid.layers as fl + + +class SkipgramModel(object): + """Implemetation of skipgram model. + + Args: + config: dict, some configure parameters. + dataset: instance of Dataset class + place: GPU or CPU place + """ + + def __init__(self, config, dataset, place): + self.config = config + self.dataset = dataset + self.place = place + self.neg_num = self.dataset.config['neg_num'] + + self.num_nodes = len(dataset.word2id) + + self.train_inputs = fl.data( + 'train_inputs', shape=[None, 1, 1], dtype='int64') + + self.train_labels = fl.data( + 'train_labels', shape=[None, 1, 1], dtype='int64') + + self.train_negs = fl.data( + 'train_negs', shape=[None, self.neg_num, 1], dtype='int64') + + self.forward() + + def backward(self, global_steps, opt_config): + """Build the optimizer. + """ + self.lr = fl.polynomial_decay(opt_config['lr'], global_steps, + opt_config['end_lr']) + adam = fluid.optimizer.Adam(learning_rate=self.lr) + adam.minimize(self.loss) + + def forward(self): + """Build the skipgram model. + """ + initrange = 1.0 / self.config['embed_dim'] + embed_init = fluid.initializer.UniformInitializer( + low=-initrange, high=initrange) + weight_init = fluid.initializer.TruncatedNormal( + scale=1.0 / math.sqrt(self.config['embed_dim'])) + + embed_src = fl.embedding( + input=self.train_inputs, + size=[self.num_nodes, self.config['embed_dim']], + param_attr=fluid.ParamAttr( + name='content', initializer=embed_init)) + + weight_pos = fl.embedding( + input=self.train_labels, + size=[self.num_nodes, self.config['embed_dim']], + param_attr=fluid.ParamAttr( + name='weight', initializer=weight_init)) + + weight_negs = fl.embedding( + input=self.train_negs, + size=[self.num_nodes, self.config['embed_dim']], + param_attr=fluid.ParamAttr( + name='weight', initializer=weight_init)) + + pos_logits = fl.matmul( + embed_src, weight_pos, transpose_y=True) # [batch_size, 1, 1] + + pos_score = fl.squeeze(pos_logits, axes=[1]) + pos_score = fl.clip(pos_score, min=-10, max=10) + pos_score = -1.0 * fl.logsigmoid(pos_score) + + neg_logits = fl.matmul( + embed_src, weight_negs, + transpose_y=True) # [batch_size, 1, neg_num] + neg_score = fl.squeeze(neg_logits, axes=[1]) + neg_score = fl.clip(neg_score, min=-10, max=10) + neg_score = -1.0 * fl.logsigmoid(-1.0 * neg_score) + neg_score = fl.reduce_sum(neg_score, dim=1, keep_dim=True) + + self.loss = fl.reduce_mean(pos_score + neg_score) diff --git a/examples/metapath2vec/multi_class.py b/examples/metapath2vec/multi_class.py new file mode 100644 index 0000000000000000000000000000000000000000..bb73410ac61adfe891f8bc07cbce57d4bf0b649e --- /dev/null +++ b/examples/metapath2vec/multi_class.py @@ -0,0 +1,241 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file provides the multi class task for testing the embedding learned by metapath2vec model. +""" +import argparse +import sys +import os +import tqdm +import time +import math +import logging +import random +import pickle as pkl + +import numpy as np +import sklearn.metrics +from sklearn.metrics import f1_score + +import pgl +import paddle.fluid as fluid +import paddle.fluid.layers as fl + +import Dataset +from utils import * + + +def load_param(dirname, var_name_list): + """load_param""" + for var_name in var_name_list: + var = fluid.global_scope().find_var(var_name) + var_tensor = var.get_tensor() + var_tmp = np.load(os.path.join(dirname, var_name + '.npy')) + var_tensor.set(var_tmp, fluid.CPUPlace()) + + +def load_data(file_, word2id): + """Load data for node classification. + """ + words_label = [] + line_count = 0 + with open(file_, 'r') as reader: + for line in reader: + line_count += 1 + tokens = line.strip().split(' ') + word, label = tokens[0], int(tokens[1]) - 1 + if word in word2id: + words_label.append((word2id[word], label)) + + words_label = np.array(words_label, dtype=np.int64) + np.random.shuffle(words_label) + + logging.info('%d/%d word_label pairs have been loaded' % + (len(words_label), line_count)) + return words_label + + +def node_classify_model(word2id, num_labels, embed_dim=16): + """Build node classify model. + + Args: + word2id(dict): map word(node) to its corresponding index + + num_labels: The number of labels. + + embed_dim: The dimension of embedding. + """ + + nodes = fl.data('nodes', shape=[None, 1], dtype='int64') + labels = fl.data('labels', shape=[None, 1], dtype='int64') + + embed_nodes = fl.embedding( + input=nodes, + size=[len(word2id), embed_dim], + param_attr=fluid.ParamAttr(name='content')) + + embed_nodes.stop_gradient = True + probs = fl.fc(input=embed_nodes, size=num_labels, act='softmax') + predict = fl.argmax(probs, axis=-1) + loss = fl.cross_entropy(input=probs, label=labels) + loss = fl.reduce_mean(loss) + + return { + 'loss': loss, + 'probs': probs, + 'predict': predict, + 'labels': labels, + } + + +def run_epoch(exe, prog, model, feed_dict, lr): + """Run training process of every epoch. + """ + if lr is None: + loss, predict = exe.run(prog, + feed=feed_dict, + fetch_list=[model['loss'], model['predict']], + return_numpy=True) + lr_ = 0 + else: + loss, predict, lr_ = exe.run( + prog, + feed=feed_dict, + fetch_list=[model['loss'], model['predict'], lr], + return_numpy=True) + + macro_f1 = f1_score(feed_dict['labels'], predict, average="macro") + micro_f1 = f1_score(feed_dict['labels'], predict, average="micro") + + return { + 'loss': loss, + 'pred': predict, + 'lr': lr_, + 'macro_f1': macro_f1, + 'micro_f1': micro_f1 + } + + +def main(args): + """main function for training node classification task. + """ + word2id = pkl.load(open(args.word2id, 'rb')) + words_label = load_data(args.dataset, word2id) + # split data for training and testing + split_position = int(words_label.shape[0] * args.train_percent) + train_words_label = words_label[0:split_position, :] + test_words_label = words_label[split_position:, :] + + place = fluid.CUDAPlace(0) if args.use_cuda else fluid.CPUPlace() + train_prog = fluid.Program() + test_prog = fluid.Program() + startup_prog = fluid.Program() + + with fluid.program_guard(train_prog, startup_prog): + with fluid.unique_name.guard(): + model = node_classify_model( + word2id, args.num_labels, embed_dim=args.embed_dim) + + test_prog = train_prog.clone(for_test=True) + + with fluid.program_guard(train_prog, startup_prog): + lr = fl.polynomial_decay(args.lr, 1000, 0.001) + adam = fluid.optimizer.Adam(lr) + adam.minimize(model['loss']) + + exe = fluid.Executor(place) + exe.run(startup_prog) + + load_param(args.ckpt_path, ['content']) + + feed_dict = {} + X = train_words_label[:, 0].reshape(-1, 1) + labels = train_words_label[:, 1].reshape(-1, 1) + logging.info('%d/%d data to train' % + (labels.shape[0], words_label.shape[0])) + + test_feed_dict = {} + test_X = test_words_label[:, 0].reshape(-1, 1) + test_labels = test_words_label[:, 1].reshape(-1, 1) + logging.info('%d/%d data to test' % + (test_labels.shape[0], words_label.shape[0])) + + for epoch in range(args.epochs): + feed_dict['nodes'] = X + feed_dict['labels'] = labels + train_result = run_epoch(exe, train_prog, model, feed_dict, lr) + + test_feed_dict['nodes'] = test_X + test_feed_dict['labels'] = test_labels + + test_result = run_epoch(exe, test_prog, model, test_feed_dict, lr=None) + + logging.info( + 'epoch %d | lr %.4f | train_loss %.5f | train_macro_F1 %.4f | train_micro_F1 %.4f | test_loss %.5f | test_macro_F1 %.4f | test_micro_F1 %.4f' + % (epoch, train_result['lr'], train_result['loss'], + train_result['macro_f1'], train_result['micro_f1'], + test_result['loss'], test_result['macro_f1'], + test_result['micro_f1'])) + + logging.info( + 'final_test_macro_f1 score: %.4f | final_test_micro_f1 score: %.4f' % + (test_result['macro_f1'], test_result['micro_f1'])) + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='multi_class') + parser.add_argument( + '--dataset', + default=None, + type=str, + help='training and testing data file(default: None)') + parser.add_argument( + '--word2id', + default=None, + type=str, + help='word2id file (default: None)') + parser.add_argument( + '--ckpt_path', default=None, type=str, help='task name(default: None)') + parser.add_argument("--use_cuda", action='store_true', help="use_cuda") + parser.add_argument( + '--train_percent', + default=0.5, + type=float, + help='train_percent(default: 0.5)') + parser.add_argument( + '--num_labels', + default=8, + type=int, + help='number of labels(default: 8)') + parser.add_argument( + '--epochs', + default=100, + type=int, + help='number of epochs for training(default: 10)') + parser.add_argument( + '--lr', + default=0.025, + type=float, + help='learning rate(default: 0.025)') + parser.add_argument( + '--embed_dim', + default=128, + type=int, + help='dimension of embedding(default: 128)') + args = parser.parse_args() + + log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s' + logging.basicConfig(level='INFO', format=log_format) + + main(args) diff --git a/examples/metapath2vec/sample.py b/examples/metapath2vec/sample.py new file mode 100644 index 0000000000000000000000000000000000000000..cbd7505d5923f807e7a8e7f6f8e3dc630f4d02c1 --- /dev/null +++ b/examples/metapath2vec/sample.py @@ -0,0 +1,248 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file implement the sampler to sample metapath random walk sequence for +training metapath2vec model. +""" + +import multiprocessing +from multiprocessing import Pool +import argparse +import sys +import os +import numpy as np +import pickle as pkl +import tqdm +import time +import logging +import random +from pgl.contrib import heter_graph +from pgl.sample import metapath_randomwalk +from utils import * + + +class Sampler(object): + """Implemetation of sampler in order to sample metapath random walk. + + Args: + config: dict, some configure parameters. + """ + + def __init__(self, config): + self.config = config + self.build_graph() + + def build_graph(self): + """Build pgl heterogeneous graph. + """ + self.conf_id2index, self.conf_name2index, conf_node_type = self.remapping_id( + self.config['data_path'] + 'id_conf.txt', + start_index=0, + node_type='conf') + logging.info('%d venues have been loaded.' % (len(self.conf_id2index))) + + self.author_id2index, self.author_name2index, author_node_type = self.remapping_id( + self.config['data_path'] + 'id_author.txt', + start_index=len(self.conf_id2index), + node_type='author') + logging.info('%d authors have been loaded.' % + (len(self.author_id2index))) + + self.paper_id2index, self.paper_name2index, paper_node_type = self.remapping_id( + self.config['data_path'] + 'paper.txt', + start_index=(len(self.conf_id2index) + len(self.author_id2index)), + node_type='paper', + separator='\t') + logging.info('%d papers have been loaded.' % + (len(self.paper_id2index))) + + node_types = conf_node_type + author_node_type + paper_node_type + num_nodes = len(node_types) + edges_by_types = {} + paper_author_edges = self.load_edges( + self.config['data_path'] + 'paper_author.txt', self.paper_id2index, + self.author_id2index) + paper_conf_edges = self.load_edges( + self.config['data_path'] + 'paper_conf.txt', self.paper_id2index, + self.conf_id2index) + + edges_by_types['edge'] = paper_author_edges + paper_conf_edges + logging.info('%d edges have been loaded.' % + (len(edges_by_types['edge']))) + + node_features = { + 'index': np.array([i for i in range(num_nodes)]).reshape( + -1, 1).astype(np.int64) + } + + self.graph = heter_graph.HeterGraph( + num_nodes=num_nodes, + edges=edges_by_types, + node_types=node_types, + node_feat=node_features) + + def remapping_id(self, file_, start_index, node_type, separator='\t'): + """Mapp the ID and name of nodes to index. + """ + node_types = [] + id2index = {} + name2index = {} + index = start_index + with open(file_, encoding="ISO-8859-1") as reader: + for line in reader: + tokens = line.strip().split(separator) + id2index[tokens[0]] = index + if len(tokens) == 2: + name2index[tokens[1]] = index + node_types.append((index, node_type)) + index += 1 + + return id2index, name2index, node_types + + def load_edges(self, file_, src2index, dst2index, symmetry=True): + """Load edges from file. + """ + edges = [] + with open(file_, 'r') as reader: + for line in reader: + items = line.strip().split() + src, dst = src2index[items[0]], dst2index[items[1]] + edges.append((src, dst)) + if symmetry: + edges.append((dst, src)) + edges = list(set(edges)) + return edges + + def generate_multi_class_data(self, name_label_file): + """Mapp the data that will be used in multi class task to index. + """ + if 'author' in name_label_file: + name2index = self.author_name2index + else: + name2index = self.conf_name2index + + index_label_list = [] + with open(name_label_file, encoding="ISO-8859-1") as reader: + for line in reader: + tokens = line.strip().split(' ') + name, label = tokens[0], int(tokens[1]) + index = name2index[name] + index_label_list.append((index, label)) + + return index_label_list + + +def generate_walks(args): + """Generate metapath random walk and save to file. + """ + g, meta_path, filename, walk_length = args + walks = [] + node_types = g._node_types + first_type = meta_path.split('-')[0] + nodes = np.where(node_types == first_type)[0] + if len(nodes) > 4000: + nodes = np.random.choice(nodes, 4000, replace=False) + + logging.info('%d number of start nodes' % (len(nodes))) + logging.info('save walks in file: %s' % (filename)) + + with open(filename, 'w') as writer: + for start_node in nodes: + walk = metapath_randomwalk(g, start_node, meta_path, walk_length) + walk = [str(walk[i]) for i in range(0, len(walk), 2)] # skip paper + writer.write(' '.join(walk) + '\n') + + +def multiprocess_generate_walks(sampler, edge_type, meta_path, num_walks, + walk_length, saved_path): + """Use multiprocess to generate metapath random walk. + """ + args = [] + for i in range(num_walks): + filename = saved_path + '%04d' % (i) + args.append( + (sampler.graph[edge_type], meta_path, filename, walk_length)) + + pool = Pool(16) + pool.map(generate_walks, args) + pool.close() + pool.join() + + +if __name__ == "__main__": + parser = argparse.ArgumentParser(description='metapath2vec') + parser.add_argument( + '-c', + '--config', + default=None, + type=str, + help='config file path (default: None)') + args = parser.parse_args() + + if args.config: + # load config file + config = Config(args.config, isCreate=False, isSave=False) + config = config() + config = config['sampler']['args'] + else: + raise AssertionError( + "Configuration file need to be specified. Add '-c config.yaml', for example." + ) + + log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s' + logging.basicConfig(level="INFO", format=log_format) + + logging.info(config) + + log_format = '%(asctime)s-%(levelname)s-%(name)s: %(message)s' + logging.basicConfig(level=getattr(logging, 'INFO'), format=log_format) + + if not os.path.exists(config['output_path']): + os.makedirs(config['output_path']) + + config['walk_saved_path'] = config['output_path'] + config[ + 'walk_saved_path'] + if not os.path.exists(config['walk_saved_path']): + os.makedirs(config['walk_saved_path']) + + sampler = Sampler(config) + + begin = time.time() + logging.info('multi process sampling') + multiprocess_generate_walks( + sampler=sampler, + edge_type='edge', + meta_path=config['metapath'], + num_walks=config['num_walks'], + walk_length=config['walk_length'], + saved_path=config['walk_saved_path']) + logging.info('total time: %.4f' % (time.time() - begin)) + + logging.info('generating multi class data') + word_label_list = sampler.generate_multi_class_data(config[ + 'author_label_file']) + with open(config['output_path'] + config['new_author_label_file'], + 'w') as writer: + for line in word_label_list: + line = [str(i) for i in line] + writer.write(' '.join(line) + '\n') + + word_label_list = sampler.generate_multi_class_data(config[ + 'venue_label_file']) + with open(config['output_path'] + config['new_venue_label_file'], + 'w') as writer: + for line in word_label_list: + line = [str(i) for i in line] + writer.write(' '.join(line) + '\n') + logging.info('finished') diff --git a/examples/metapath2vec/utils.py b/examples/metapath2vec/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..c83a904177819768f4c11c32f2b8515c02f3cefc --- /dev/null +++ b/examples/metapath2vec/utils.py @@ -0,0 +1,94 @@ +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +This file implement a class for model configure. +""" + +import datetime +import os +import yaml +import random +import shutil + + +class Config(object): + """Implementation of Config class for model configure. + + Args: + config_file(str): configure filename, which is a yaml file. + isCreate(bool): if true, create some neccessary directories to save models, log file and other outputs. + isSave(bool): if true, save config_file in order to record the configure message. + """ + + def __init__(self, config_file, isCreate=False, isSave=False): + self.config_file = config_file + self.config = self.get_config_from_yaml(config_file) + + if isCreate: + self.create_necessary_dirs() + + if isSave: + self.save_config_file() + + def get_config_from_yaml(self, yaml_file): + """Get the configure hyperparameters from yaml file. + """ + try: + with open(yaml_file, 'r') as f: + config = yaml.load(f) + except Exception: + raise IOError("Error in parsing config file '%s'" % yaml_file) + + return config + + def create_necessary_dirs(self): + """Create some necessary directories to save some important files. + """ + + time_stamp = datetime.datetime.now().strftime('%m%d_%H%M') + self.config['trainer']['args']['log_dir'] = ''.join( + (self.config['trainer']['args']['log_dir'], + self.config['task_name'], '/')) # , '.%s/' % (time_stamp))) + self.config['trainer']['args']['save_dir'] = ''.join( + (self.config['trainer']['args']['save_dir'], + self.config['task_name'], '/')) # , '.%s/' % (time_stamp))) + self.config['trainer']['args']['output_dir'] = ''.join( + (self.config['trainer']['args']['output_dir'], + self.config['task_name'], '/')) # , '.%s/' % (time_stamp))) + # if os.path.exists(self.config['trainer']['args']['save_dir']): + # input('save_dir is existed, do you really want to continue?') + + self.make_dir(self.config['trainer']['args']['log_dir']) + self.make_dir(self.config['trainer']['args']['save_dir']) + self.make_dir(self.config['trainer']['args']['output_dir']) + + def save_config_file(self): + """Save config file so that we can know the config when we look back + """ + filename = self.config_file.split('/')[-1] + targetpath = self.config['trainer']['args']['save_dir'] + shutil.copyfile(self.config_file, targetpath + filename) + + def make_dir(self, path): + """Build directory""" + if not os.path.exists(path): + os.makedirs(path) + + def __getitem__(self, key): + """Return the configure dict""" + return self.config[key] + + def __call__(self): + """__call__""" + return self.config diff --git a/examples/node2vec/README.md b/examples/node2vec/README.md index 4339c40b70d822bbd1d1b06aae6431e07d888377..db315a938796cb840292143aaf1323d9d2eea561 100644 --- a/examples/node2vec/README.md +++ b/examples/node2vec/README.md @@ -1,4 +1,4 @@ -# PGL Examples for node2vec +# node2vec: Scalable Feature Learning for Networks [Node2vec](https://cs.stanford.edu/~jure/pubs/node2vec-kdd16.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce node2vec algorithms and reach the same level of indicators as the paper. ## Datasets The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3) and [Arxiv](http://snap.stanford.edu/data/ca-AstroPh.html). @@ -16,7 +16,8 @@ python node2vec.py --use_cuda --dataset BlogCatalog --save_path ./tmp/node2vec_B python multi_class.py --use_cuda --ckpt_path ./tmp/node2vec_BlogCatalog/paddle_model --epoch 1000 # link prediction task example -python node2vec.py --use_cuda --dataset ArXiv --save_path ./tmp/node2vec_ArXiv --offline_learning --epoch 400 +python node2vec.py --use_cuda --dataset ArXiv --save_path +./tmp/node2vec_ArXiv --offline_learning --epoch 10 python link_predict.py --use_cuda --ckpt_path ./tmp/node2vec_ArXiv/paddle_model --epoch 400 ``` diff --git a/examples/sgc/README.md b/examples/sgc/README.md index 92fc546d51a9b07a45ff3d44fc000fa44a36b6b0..f45acd812e30b4defffe5a771759f037d6b2a0ba 100644 --- a/examples/sgc/README.md +++ b/examples/sgc/README.md @@ -1,4 +1,4 @@ -# PGL Examples for SGC +# SGC: Simplifying Graph Convolutional Networks [Simplifying Graph Convolutional Networks \(SGC\)](https://arxiv.org/pdf/1902.07153.pdf) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce SGC algorithms and reach the same level of indicators as the paper in citation network benchmarks. diff --git a/examples/strucvec/README.md b/examples/strucvec/README.md index b23c10a297014f6822e0ce1e4c7b3b1ac4559cd9..3a8b5b13a387231a7172d03a757ba3e5b13d4131 100644 --- a/examples/strucvec/README.md +++ b/examples/strucvec/README.md @@ -1,13 +1,13 @@ -## PGL Examples For Struc2Vec +# struc2vec: Learning Node Representations from Structural Identity [Struc2vec](https://arxiv.org/abs/1704.03165) is is a concept of symmetry in which network nodes are identified according to the network structure and their relationship to other nodes. A novel and flexible framework for learning latent representations is proposed in the paper of struc2vec. We reproduce Struc2vec algorithm in the PGL. -## DataSet +### DataSet The paper of use air-traffic network to valid algorithm of Struc2vec. The each edge in the dataset indicate that having one flight between the airports. Using the the connection between the airports to predict the level of activity. The following dataset will be used to valid the algorithm accuracy.Data collected from the Bureau of Transportation Statistics2 from January to October, 2016. The network has 1,190 nodes, 13,599 edges (diameter is 8). [Link](https://www.transtats.bts.gov/) - usa-airports.edgelist - labels-usa-airports.txt -## Dependencies +### Dependencies If use want to use the struc2vec model in pgl, please install the gensim, pathos, fastdtw additional. - paddlepaddle>=1.6 - pgl @@ -15,11 +15,11 @@ If use want to use the struc2vec model in pgl, please install the gensim, pathos - pathos - fastdtw -## How to use +### How to use For examples, we want to train and valid the Struc2vec model on American airpot dataset > python struc2vec.py --edge_file data/usa-airports.edgelist --label_file data/labels-usa-airports.txt --train True --valid True --opt2 True -## Hyperparameters +### Hyperparameters | Args| Meaning| | ------------- | ------------- | | edge_file | input file name for edges| @@ -35,7 +35,7 @@ For examples, we want to train and valid the Struc2vec model on American airpot | valid| The flag to use the w2v embedding to valid the classification result| | num_class| The num of class in classification model to be trained| -## Experiment results +### Experiment results | Dataset | Model | Metric | PGL Result | Paper repo Result | | ------------- | ------------- |------------- |------------- |------------- | | American airport dataset | Struc2vec without time cost optimization| ACC |0.6483|0.6340| diff --git a/examples/strucvec/classify.py b/examples/strucvec/classify.py index daaa87eb8dee6e117ae1c523c089ddcc51af206f..d36bf6a59415c134ca0d9aa5e3e3c64aa502aedc 100644 --- a/examples/strucvec/classify.py +++ b/examples/strucvec/classify.py @@ -1,4 +1,7 @@ -# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +""" +classify.py +""" +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -11,7 +14,6 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. - import numpy as np import paddle import paddle.fluid as fluid diff --git a/examples/strucvec/sklearn_classify.py b/examples/strucvec/sklearn_classify.py index b2dde24b9af2f5d9e8455a4efb114ea7e51b3d14..07e5c875a225fca6d409611c8d0941a0129c210d 100644 --- a/examples/strucvec/sklearn_classify.py +++ b/examples/strucvec/sklearn_classify.py @@ -1,4 +1,7 @@ -# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved. +""" +sklearn_classify.py +""" +# Copyright (c) 2019 PaddlePaddle Authors. All Rights Reserved # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. @@ -36,7 +39,7 @@ def train_lr_l2_model(args, data): test_size=0.2, random_state=random_num + random_seed) - # use the one vs rest to train the lr model with l2 + # use the one vs rest to train the lr model with l2 pred_test = [] for i in range(0, args.num_class): y_train_relabel = np.where(y_train == i, 1, 0) diff --git a/examples/unsup_graphsage/README.md b/examples/unsup_graphsage/README.md new file mode 100644 index 0000000000000000000000000000000000000000..a70fc276f30c875cca32e1014501dd60d889cfdb --- /dev/null +++ b/examples/unsup_graphsage/README.md @@ -0,0 +1,32 @@ +# Unsupervised GraphSAGE in PGL +[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature +information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL. +For purpose of unsupervised learning, we use graph edges as positive samples for graphsage training. +### Datasets(Quickstart) +The dataset `./sample.txt` is handcrafted bigraph for quick demo purpose, which format is `src \t dst`. +### Dependencies +```txt +- paddlepaddle>=1.6 +- pgl +``` +### How to run +#### 1. Training +```sh +python train.py --data_path ./sample.txt --num_nodes 2000 --phase train +``` +#### 2. Predicting +```sh +python train.py --data_path ./sample.txt --num_nodes 2000 --phase predict +``` +The resulted node embedding is stored in `emb.npy` file, which latter can be loaded using `np.load`. +#### Hyperparameters +- epoch: Number of epochs default (1) +- use_cuda: Use gpu if assign use_cuda. +- layer_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm". +- sample_workers: The number of workers for multiprocessing subgraph sample. +- lr: Learning rate. +- batch_size: Batch size. +- samples: The max neighbors sampling rate for each hop. (default: [10, 10]) +- num_layers: The number of layer for graph sampling. (default: 2) +- hidden_size: The hidden size of the GraphSAGE models. +- checkpoint. Path for model checkpoint at each epoch. (default: 'model_ckpt') diff --git a/examples/unsup_graphsage/train.py b/examples/unsup_graphsage/train.py index b5655d61ea32b60f4a8abfa5f283595c8a236ac6..a53ffdcb042e74ed262b10a44e0b5c36fad3b1ef 100644 --- a/examples/unsup_graphsage/train.py +++ b/examples/unsup_graphsage/train.py @@ -318,7 +318,7 @@ def main(args): model_dict=ret_dict, epoch=epoch, batch_size=args.batch_size, - log_per_step=10) + log_per_step=1) epoch_end = time.time() print("Epoch: {0}, Train total expend: {1} ".format( epoch, epoch_end - epoch_start)) @@ -326,7 +326,7 @@ def main(args): log.info("Run Epoch Error %s" % e) fluid.io.save_params( exe, - dirname=args.checkpoint + '_%s' % epoch, + dirname=args.checkpoint + '_%s' % (epoch + 1), main_program=train_program) log.info("EPOCH END") @@ -343,7 +343,7 @@ def main(args): args.num_layers, ret_dict.graph_wrappers, batch_size=args.batch_size, - data=(test_src, test_src, test_src, test_src), + data=(test_src, test_src, test_src), samples=args.samples, num_workers=args.sample_workers, feed_name_list=feed_name_list, diff --git a/pgl/__init__.py b/pgl/__init__.py index 15c23cc633540a53b2b445a3012f2669319ba5d4..c7f323d9f9020b657b7d1b1c2c1251107448fff8 100644 --- a/pgl/__init__.py +++ b/pgl/__init__.py @@ -13,7 +13,7 @@ # limitations under the License. """Generate pgl apis """ -__version__ = "1.0.0" +__version__ = "1.0.1" from pgl import layers from pgl import graph_wrapper from pgl import graph diff --git a/pgl/contrib/heter_graph.py b/pgl/contrib/heter_graph.py index b82bf46b729ad6f0df56b25db4053b09bf9ef361..117baed86bd82caaf9a278e1b6c987e9064f8ec9 100644 --- a/pgl/contrib/heter_graph.py +++ b/pgl/contrib/heter_graph.py @@ -14,11 +14,12 @@ """ This package implement Heterogeneous Graph structure for handling Heterogeneous graph data. """ +import time import numpy as np import pickle as pkl import time import pgl.graph_kernel as graph_kernel -from pgl import graph +from pgl.graph import Graph __all__ = ['HeterGraph'] @@ -31,123 +32,111 @@ def _hide_num_nodes(shape): return shape -class HeterGraph(object): - """Implementation of graph structure in pgl - - This is a simple implementation of heterogeneous graph structure in pgl +class NodeGraph(Graph): + """Implementation of a graph that has multple node types. Args: - num_nodes_every_type: dict, number of nodes for every node type + num_nodes: number of nodes in the graph + edges: list of (u, v) tuples + node_types (optional): list of (u, node_type) tuples to specify the node type of every node + node_feat (optional): a dict of numpy array as node features + edge_feat (optional): a dict of numpy array as edge features + """ + + def __init__(self, + num_nodes, + edges, + node_types=None, + node_feat=None, + edge_feat=None): + super(NodeGraph, self).__init__(num_nodes, edges, node_feat, edge_feat) + + if isinstance(node_types, list): + self._node_types = np.array(node_types, dtype=object)[:, 1] + else: + self._node_types = node_types + + +class HeterGraph(object): + """Implementation of heterogeneous graph structure in pgl - edges_every_type: dict, every element is a list of (u, v) tuples. + This is a simple implementation of heterogeneous graph structure in pgl. - node_feat_every_type: features for every node type. + Args: + num_nodes: number of nodes in a heterogeneous graph + edges: dict, every element in dict is a list of (u, v) tuples. + node_types (optional): list of (u, node_type) tuples to specify the node type of every node + node_feat (optional): a dict of numpy array as node features + edge_feat (optional): a dict of dict as edge features for every edge type Examples: .. code-block:: python import numpy as np - num_nodes_every_type = {'type1':3,'type2':4, 'type3':2} - edges_every_type = { - ('type1','type2', 'edges_type1'): [(0,1), (1,2)], - ('type1', 'type3', 'edges_type2'): [(1,2), (3,1)], - } - node_feat_every_type = { - 'type1': {'features1': np.random.randn(3, 4), - 'features2': np.random.randn(3, 4)}, - 'type2': {'features3': np.random.randn(4, 4)}, - 'type3': {'features1': np.random.randn(2, 4), - 'features2': np.random.randn(2, 4)} + num_nodes = 4 + node_types = [(0, 'user'), (1, 'item'), (2, 'item'), (3, 'user')] + edges = { + 'edges_type1': [(0,1), (3,2)], + 'edges_type2': [(1,2), (3,1)], } - edges_feat_every_type = { - ('type1','type2','edges_type1'): {'h': np.random.randn(2, 4)}, - ('type1', 'type3', 'edges_type2'): {'h':np.random.randn(2, 4)}, + node_feat = {'feature': np.random.randn(4, 16)} + edges_feat = { + 'edges_type1': {'h': np.random.randn(2, 16)}, + 'edges_type2': {'h': np.random.randn(2, 16)}, } g = heter_graph.HeterGraph( - num_nodes_every_type=num_nodes_every_type, - edges_every_type=edges_every_type, - node_feat_every_type=node_feat_every_type, - edge_feat_every_type=edges_feat_every_type) - + num_nodes=num_nodes, + edges=edges, + node_types=node_types, + node_feat=node_feat, + edge_feat=edges_feat) """ def __init__(self, - num_nodes_every_type, - edges_every_type, - node_feat_every_type=None, - edge_feat_every_type=None): - - self._num_nodes_dict = num_nodes_every_type - self._edges_dict = edges_every_type - if node_feat_every_type is not None: - self._node_feat = node_feat_every_type + num_nodes, + edges, + node_types=None, + node_feat=None, + edge_feat=None): + self._num_nodes = num_nodes + self._edges_dict = edges + + if node_feat is not None: + self._node_feat = node_feat else: self._node_feat = {} - if edge_feat_every_type is not None: - self._edge_feat = edge_feat_every_type + if edge_feat is not None: + self._edge_feat = edge_feat else: self._edge_feat = {} self._multi_graph = {} for key, value in self._edges_dict.items(): - if not self._node_feat: - node_feat = None - else: - node_feat = self._node_feat[key[0]] - if not self._edge_feat: edge_feat = None else: edge_feat = self._edge_feat[key] - self._multi_graph[key] = graph.Graph( - num_nodes=self._num_nodes_dict[key[1]], + self._multi_graph[key] = NodeGraph( + num_nodes=self._num_nodes, edges=value, - node_feat=node_feat, + node_types=node_types, + node_feat=self._node_feat, edge_feat=edge_feat) + @property + def num_nodes(self): + """Return the number of nodes. + """ + return self._num_nodes + def __getitem__(self, edge_type): """__getitem__ """ return self._multi_graph[edge_type] - def meta_path_random_walk(self, start_node, edge_types, meta_path, - max_depth): - """Meta path random walk sampling. - - Args: - start_nodes: int, node to begin random walk. - edge_types: list, the edge types to be sampled. - meta_path: 'user-item-user' - max_depth: the max length of every walk. - """ - edges_type_list = [] - node_type_list = meta_path.split('-') - for i in range(1, len(node_type_list)): - edges_type_list.append( - (node_type_list[i - 1], node_type_list[i], edge_types[i - 1])) - - no_neighbors_flag = False - walk = [start_node] - for i in range(max_depth): - for e_type in edges_type_list: - cur_node = [walk[-1]] - nxt_node = self._multi_graph[e_type].sample_successor( - cur_node, max_degree=1) # list of np.array - nxt_node = nxt_node[0] - if len(nxt_node) == 0: - no_neighbors_flag = True - break - else: - walk.append(nxt_node.tolist()[0]) - - if no_neighbors_flag: - break - - return walk - def node_feat_info(self): """Return the information of node feature for HeterGraphWrapper. @@ -155,17 +144,13 @@ class HeterGraph(object): function is used to help constructing HeterGraphWrapper Return: - A dict of list of tuple (name, shape, dtype) for all given node feature. + A list of tuple (name, shape, dtype) for all given node feature. """ - node_feat_info = {} - for node_type_name, feat_dict in self._node_feat.items(): - tmp_node_feat_info = [] - for feat_name, feat in feat_dict.items(): - full_name = feat_name - tmp_node_feat_info.append( - (full_name, _hide_num_nodes(feat.shape), feat.dtype)) - node_feat_info[node_type_name] = tmp_node_feat_info + node_feat_info = [] + for feat_name, feat in self._node_feat.items(): + node_feat_info.append( + (feat_name, _hide_num_nodes(feat.shape), feat.dtype)) return node_feat_info @@ -193,7 +178,7 @@ class HeterGraph(object): """Return the information of all edge types. Return: - A list of tuple ('srctype','dsttype', 'edges_type') for all edge types. + A list of all edge types. """ edge_types_info = [] diff --git a/pgl/contrib/heter_graph_wrapper.py b/pgl/contrib/heter_graph_wrapper.py index 27937c11c2f1610ff49d670bf24a76e1aa2cb54d..5282bfa51c7deb5776400d557dbca917fb37f04f 100644 --- a/pgl/contrib/heter_graph_wrapper.py +++ b/pgl/contrib/heter_graph_wrapper.py @@ -26,6 +26,7 @@ from pgl.utils.logger import log from pgl.graph_wrapper import GraphWrapper ALL = "__ALL__" +__all__ = ["HeterGraphWrapper"] def is_all(arg): @@ -34,89 +35,6 @@ def is_all(arg): return isinstance(arg, str) and arg == ALL -class BipartiteGraphWrapper(GraphWrapper): - """Implement a bipartite graph wrapper that creates a graph data holders. - """ - - def __init__(self, name, place, node_feat=[], edge_feat=[]): - super(BipartiteGraphWrapper, self).__init__(name, place, node_feat, - edge_feat) - - def send(self, - message_func, - src_nfeat_list=None, - dst_nfeat_list=None, - efeat_list=None): - """Send message from all src nodes to dst nodes. - - The UDF message function should has the following format. - - .. code-block:: python - - def message_func(src_feat, dst_feat, edge_feat): - ''' - Args: - src_feat: the node feat dict attached to the src nodes. - dst_feat: the node feat dict attached to the dst nodes. - edge_feat: the edge feat dict attached to the - corresponding (src, dst) edges. - - Return: - It should return a tensor or a dictionary of tensor. And each tensor - should have a shape of (num_edges, dims). - ''' - pass - - Args: - message_func: UDF function. - src_nfeat_list: a list of tuple (name, tensor) for src nodes - dst_nfeat_list: a list of tuple (name, tensor) for dst nodes - efeat_list: a list of names or tuple (name, tensor) - - Return: - A dictionary of tensor representing the message. Each of the values - in the dictionary has a shape (num_edges, dim) which should be collected - by :code:`recv` function. - """ - if efeat_list is None: - efeat_list = {} - if src_nfeat_list is None: - src_nfeat_list = {} - if dst_nfeat_list is None: - dst_nfeat_list = {} - - src, dst = self.edges - src_feat = {} - for feat in src_nfeat_list: - if isinstance(feat, str): - src_feat[feat] = self.node_feat[feat] - else: - name, tensor = feat - src_feat[name] = tensor - - dst_feat = {} - for feat in dst_nfeat_list: - if isinstance(feat, str): - dst_feat[feat] = self.node_feat[feat] - else: - name, tensor = feat - dst_feat[name] = tensor - - efeat = {} - for feat in efeat_list: - if isinstance(feat, str): - efeat[feat] = self.edge_feat[feat] - else: - name, tensor = feat - efeat[name] = tensor - - src_feat = op.read_rows(src_feat, src) - dst_feat = op.read_rows(dst_feat, dst) - msg = message_func(src_feat, dst_feat, efeat) - - return msg - - class HeterGraphWrapper(object): """Implement a heterogeneous graph wrapper that creates a graph data holders that attributes and features in the heterogeneous graph. @@ -141,43 +59,40 @@ class HeterGraphWrapper(object): (-1 or None) or we can easily use :code:`HeterGraph.edge_feat_info()` to get the edge_feat settings. - Examples: - .. code-block:: python - - import paddle.fluid as fluid - import numpy as np - num_nodes_every_type = {'type1':3,'type2':4, 'type3':2} - edges_every_type = { - ('type1','type2', 'edges_type1'): [(0,1), (1,2)], - ('type1', 'type3', 'edges_type2'): [(1,2), (3,1)], - } - node_feat_every_type = { - 'type1': {'features1': np.random.randn(3, 4), - 'features2': np.random.randn(3, 4)}, - 'type2': {'features3': np.random.randn(4, 4)}, - 'type3': {'features1': np.random.randn(2, 4), - 'features2': np.random.randn(2, 4)} - } - edges_feat_every_type = { - ('type1','type2','edges_type1'): {'h': np.random.randn(2, 4)}, - ('type1', 'type3', 'edges_type2'): {'h':np.random.randn(2, 4)}, - } - - g = heter_graph.HeterGraph( - num_nodes_every_type=num_nodes_every_type, - edges_every_type=edges_every_type, - node_feat_every_type=node_feat_every_type, - edge_feat_every_type=edges_feat_every_type) - - - place = fluid.CPUPlace() - - gw = pgl.heter_graph_wrapper.HeterGraphWrapper( - name='heter_graph', - place = place, - edge_types = g.edge_types_info(), - node_feat=g.node_feat_info(), - edge_feat=g.edge_feat_info()) + Examples: + .. code-block:: python + + import paddle.fluid as fluid + import numpy as np + from pgl.contrib import heter_graph + from pgl.contrib import heter_graph_wrapper + num_nodes = 4 + node_types = [(0, 'user'), (1, 'item'), (2, 'item'), (3, 'user')] + edges = { + 'edges_type1': [(0,1), (3,2)], + 'edges_type2': [(1,2), (3,1)], + } + node_feat = {'feature': np.random.randn(4, 16)} + edges_feat = { + 'edges_type1': {'h': np.random.randn(2, 16)}, + 'edges_type2': {'h': np.random.randn(2, 16)}, + } + + g = heter_graph.HeterGraph( + num_nodes=num_nodes, + edges=edges, + node_types=node_types, + node_feat=node_feat, + edge_feat=edges_feat) + + place = fluid.CPUPlace() + + gw = heter_graph_wrapper.HeterGraphWrapper( + name='heter_graph', + place = place, + edge_types = g.edge_types_info(), + node_feat=g.node_feat_info(), + edge_feat=g.edge_feat_info()) """ def __init__(self, name, place, edge_types, node_feat={}, edge_feat={}): @@ -186,10 +101,9 @@ class HeterGraphWrapper(object): self._edge_types = edge_types self._multi_gw = {} for edge_type in self._edge_types: - type_name = self.__data_name_prefix + '/' + edge_type[ - 0] + '_' + edge_type[1] + type_name = self.__data_name_prefix + '/' + edge_type if node_feat: - n_feat = node_feat[edge_type[0]] + n_feat = node_feat else: n_feat = {} @@ -198,7 +112,7 @@ class HeterGraphWrapper(object): else: e_feat = {} - self._multi_gw[edge_type] = BipartiteGraphWrapper( + self._multi_gw[edge_type] = GraphWrapper( name=type_name, place=self._place, node_feat=n_feat, diff --git a/pgl/graph_wrapper.py b/pgl/graph_wrapper.py index d41d09c5fcf62bfc8661f5999a720392c295a186..797796bd3410d3288fc8134874edee2acdb14665 100644 --- a/pgl/graph_wrapper.py +++ b/pgl/graph_wrapper.py @@ -596,8 +596,7 @@ class GraphWrapper(BaseGraphWrapper): feed_dict[self.__data_name_prefix + '/edges_src'] = src feed_dict[self.__data_name_prefix + '/edges_dst'] = dst - feed_dict[self.__data_name_prefix + '/num_nodes'] = np.array( - graph.num_nodes) + feed_dict[self.__data_name_prefix + '/num_nodes'] = np.array(graph.num_nodes) feed_dict[self.__data_name_prefix + '/uniq_dst'] = uniq_dst feed_dict[self.__data_name_prefix + '/uniq_dst_count'] = uniq_dst_count feed_dict[self.__data_name_prefix + '/node_ids'] = graph.nodes diff --git a/pgl/layers/__init__.py b/pgl/layers/__init__.py index c30291fe9102f98017b4d6aca241038c4823694f..a88c9b69b33335e091d5713400395f7c003361b4 100644 --- a/pgl/layers/__init__.py +++ b/pgl/layers/__init__.py @@ -16,6 +16,9 @@ from pgl.layers import conv from pgl.layers.conv import * +from pgl.layers import set2set +from pgl.layers.set2set import * __all__ = [] __all__ += conv.__all__ +__all__ += set2set.__all__ diff --git a/pgl/layers/set2set.py b/pgl/layers/set2set.py index 18d65fae97009a770900130ccf14d248df06ebed..f574c9f5e031cd0e5d21126c1498a5430b6e887c 100644 --- a/pgl/layers/set2set.py +++ b/pgl/layers/set2set.py @@ -23,6 +23,8 @@ import paddle.fluid.layers as L import pgl +__all__ = ['Set2Set'] + class Set2Set(object): """Implementation of set2set pooling operator. diff --git a/pgl/sample.py b/pgl/sample.py index 176c32b9c85e7819646de494966120583c2ddb98..b0b13bb3150276f6134fb8fdadd0fa9f62a48937 100644 --- a/pgl/sample.py +++ b/pgl/sample.py @@ -22,7 +22,10 @@ import pgl from pgl.utils.logger import log from pgl import graph_kernel -__all__ = ['graphsage_sample', 'node2vec_sample', 'deepwalk_sample'] +__all__ = [ + 'graphsage_sample', 'node2vec_sample', 'deepwalk_sample', + 'metapath_randomwalk' +] def edge_hash(src, dst): @@ -251,3 +254,45 @@ def node2vec_sample(graph, nodes, max_depth, p=1.0, q=1.0): prev_nodes, prev_succs = cur_nodes, cur_succs cur_nodes = nxt_nodes return walk + + +def metapath_randomwalk(graph, start_node, metapath, walk_length): + """Implementation of metapath random walk in heterogeneous graph. + + Args: + graph: instance of pgl heterogeneous graph + start_node: start node to generate walk + metapath: meta path for sample nodes. + e.g: "user-item-user" + walk_length: the walk length + + Return: + a list of metapath walk, each element is a node id. + + """ + np.random.seed() + walk = [] + metapath = metapath.split('-') + assert metapath[0] == metapath[ + -1], "The last meta path item should be the same as the first one" + mp_len = len(metapath) - 1 + + walk.append(start_node) + for i in range(1, walk_length): + cur_node = walk[-1] + succs = graph.successor(cur_node) + if succs.size > 0: + succs_node_types = graph._node_types[succs] + else: + # no successor of current node + break + + succs_nodes = succs[np.where(succs_node_types == metapath[i % mp_len])[ + 0]] + if succs_nodes.size > 0: + walk.append(np.random.choice(succs_nodes)) + else: + # no successor of such node type + break + + return walk diff --git a/pgl/utils/paddle_helper.py b/pgl/utils/paddle_helper.py index f1e53aedcd25389f1994301a5213defabd62ba52..adbece57a6580f266eba5e6158207127218b79e0 100644 --- a/pgl/utils/paddle_helper.py +++ b/pgl/utils/paddle_helper.py @@ -223,7 +223,7 @@ def scatter_add(input, index, updates): Same type and shape as input. """ - output = fluid.layers.scatter(input, index, updates, mode='add') + output = fluid.layers.scatter(input, index, updates, overwrite=False) return output diff --git a/requirements.txt b/requirements.txt index e0cabc03623e4204e31dfc0da9a0fb49f5b6c3d9..4a13b16b9df0af570752f5dac9857fa7e496d3db 100644 --- a/requirements.txt +++ b/requirements.txt @@ -3,4 +3,4 @@ cython >= 0.25.2 #paddlepaddle -redis-py-cluster == 1.3.6 +redis-py-cluster