[Deep Graph Infomax \(DGI\)](https://arxiv.org/abs/1809.10341) is a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.6
- pgl
### Performance
We use DGI to pretrain embeddings for each nodes. Then we fix the embedding to train a node classifier.
| Dataset | Accuracy |
| --- | --- |
| Cora | ~81% |
| Pubmed | ~77.6% |
| Citeseer | ~71.3% |
### How to run
For examples, use gpu to train gcn on cora dataset.
```
python dgi.py --dataset cora --use_cuda
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
[Deepwalk](https://arxiv.org/pdf/1403.6652.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce distributed deepwalk algorithms and reach the same level of indicators as the paper.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3).
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0
## How to run
We adopt [PaddlePaddle Fleet](https://github.com/PaddlePaddle/Fleet) as our distributed training frameworks ```pgl_deepwalk.cfg``` is config file for deepwalk hyperparameter and ```local_config``` is a config file for parameter servers. By default, we have 2 pservers and 2 trainers. We can use ```cloud_run.sh``` to help you startup the parameter servers and model trainers.
For examples, train deepwalk in distributed mode on BlogCataLog dataset.
[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.
For purpose of high scalability, we use redis as distribute graph storage solution and training graphsage against redis server.
### Datasets(Quickstart)
The reddit dataset should be downloaded from [reddit_adj.npz](https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt) and [reddit.npz](https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2Jthe). The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
Alternatively, reddit dataset has been preprocessed and packed into docker image, which can be instantly pulled using following commands.
[GATNE](https://arxiv.org/pdf/1905.01669.pdf) is a algorithms framework for embedding large-scale Attributed Multiplex Heterogeneous Networks(AMHN). Given a heterogeneous graph, which consists of nodes and edges of multiple types, it can learn continuous feature representations for every node. Based on PGL, we reproduce GATNE algorithm.
## Datasets
YouTube dataset contains 2000 nodes, 1310617 edges and 5 edge types. And we use YouTube dataset for example.
You can dowload YouTube datasets from [here](https://github.com/THUDM/GATNE/tree/master/data)
After downloading the data, put them, let's say, in ./data/ . Note that the current directory is the root directory of GATNE model. Then in ./data/youtube/ directory, there are three files:
* train.txt
* valid.txt
* test.txt
Then you can run the below command to preprocess the data.
All the hyper parameters are saved in config.yaml file. So before training GATNE model, you can open the config.yaml to modify the hyper parameters as you like.
for example, you can change the \"use_cuda\" to \"True \" in order to use GPU for training or modify \"data_path\" to use different dataset.
Some important hyper parameters in config.yaml:
- use_cuda: use GPU to train model
- data_path: the directory of dataset
- lr: learning rate
- neg_num: number of negatie samples.
- num_walks: number of walks started from each node
[Graph Embedding with Side Information](https://arxiv.org/pdf/1803.02349.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce ges algorithms.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3).
[LINE](http://www.www2015.it/documents/proceedings/proceedings/p1067.pdf) is an algorithmic framework for embedding very large-scale information networks. It is suitable to a variety of networks including directed, undirected, binary or weighted edges. Based on PGL, we reproduce LINE algorithms and reach the same level of indicators as the paper.
## Datasets
[Flickr network](http://socialnetworks.mpi-sws.org/data-imc2007.html) is a social network, which contains 1715256 nodes and 22613981 edges.
You can dowload data from [here](http://socialnetworks.mpi-sws.org/data-imc2007.html).
Flickr network contains four files:
* flickr-groupmemberships.txt.gz
* flickr-groups.txt.gz
* flickr-links.txt.gz
* flickr-users.txt.gz
After downloading the data,uncompress them, let's say, in **./data/flickr/** . Note that the current directory is the root directory of LINE model.
Then you can run the below command to preprocess the data.
```sh
python data_process.py
```
Then it will produce three files in **./data/flickr/** directory:
* nodes.txt
* edges.txt
* nodes_label.txt
## Dependencies
- paddlepaddle>=1.6
- pgl>1.0.0
## How to run
For examples, use gpu to train LINE on Flickr dataset.
[metapath2vec](https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf) is a algorithm framework for representation learning in heterogeneous networks which contains multiple types of nodes and links. Given a heterogeneous graph, metapath2vec algorithm first generates meta-path-based random walks and then use skipgram model to train a language model. Based on PGL, we reproduce metapath2vec algorithm.
## Datasets
You can dowload datasets from [here](https://ericdongyx.github.io/metapath2vec/m2v.html)
We use the "aminer" data for example. After downloading the aminer data, put them, let's say, in ./data/net_aminer/ . We also need to put "label/" directory in ./data/.
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
## Hyperparameters
All the hyper parameters are saved in config.yaml file. So before training, you can open the config.yaml to modify the hyper parameters as you like.
for example, you can change the \"use_cuda\" to \"True \" in order to use GPU for training or modify \"data_path\" to specify the data you want.
Some important hyper parameters in config.yaml:
-**use_cuda**: use GPU to train model
-**data_path**: the directory of dataset that you want to load
-**lr**: learning rate
-**neg_num**: number of negatie samples.
-**num_walks**: number of walks started from each node
-**walk_length**: walk length
-**metapath**: meta path scheme
## Metapath randomwalk sampling
Before training, we should generate some metapath random walks to train skipgram model. we can run the below command to produce metapath randomwalk data.
```sh
python sample.py -c config.yaml
```
## Training and Testing
After finishing metapath randomwalk sampling, you can run the below command to train and test the model.
[Simplifying Graph Convolutional Networks \(SGC\)](https://arxiv.org/pdf/1902.07153.pdf) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce SGC algorithms and reach the same level of indicators as the paper in citation network benchmarks.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
[Struc2vec](https://arxiv.org/abs/1704.03165) is is a concept of symmetry in which network nodes are identified according to the network structure and their relationship to other nodes. A novel and flexible framework for learning latent representations is proposed in the paper of struc2vec. We reproduce Struc2vec algorithm in the PGL.
### DataSet
The paper of use air-traffic network to valid algorithm of Struc2vec.
The each edge in the dataset indicate that having one flight between the airports. Using the the connection between the airports to predict the level of activity. The following dataset will be used to valid the algorithm accuracy.Data collected from the Bureau of Transportation Statistics2 from January to October, 2016. The network has 1,190 nodes, 13,599 edges (diameter is 8). [Link](https://www.transtats.bts.gov/)
- usa-airports.edgelist
- labels-usa-airports.txt
### Dependencies
If use want to use the struc2vec model in pgl, please install the gensim, pathos, fastdtw additional.
- paddlepaddle>=1.6
- pgl
- gensim
- pathos
- fastdtw
### How to use
For examples, we want to train and valid the Struc2vec model on American airpot dataset