提交 7ec70c12 编写于 作者: L liweibin

add docs

上级 c26ccbc3
pgl.contrib.heter\_graph module: Heterogenous Graph Storage
===============================
.. automodule:: pgl.contrib.heter_graph
:members:
:undoc-members:
:show-inheritance:
pgl.contrib.heter\_graph\_wrapper module: Heterogenous Graph data holders for Paddle GNN.
=========================
.. automodule:: pgl.contrib.heter_graph_wrapper
:members:
:undoc-members:
:show-inheritance:
...@@ -9,3 +9,5 @@ API Reference ...@@ -9,3 +9,5 @@ API Reference
pgl.data_loader pgl.data_loader
pgl.utils.paddle_helper pgl.utils.paddle_helper
pgl.utils.mp_reader pgl.utils.mp_reader
pgl.contrib.heter_graph
pgl.contrib.heter_graph_wrapper
.. mdinclude:: md/dgi_examples.md
View the Code
=============
examples/dgi/train.py
.. literalinclude:: ../../../examples/dgi/train.py
:language: python
:linenos:
.. mdinclude:: md/distribute_deepwalk_examples.md
View the Code
=============
examples/distribute_deepwalk/gpu_train.py
.. literalinclude:: ../../../examples/distribute_deepwalk/gpu_train.py
:language: python
:linenos:
.. mdinclude:: md/distribute_graphsage_examples.md
View the Code
=============
examples/distribute_graphsage/train.py
.. literalinclude:: ../../../examples/distribute_graphsage/train.py
:language: python
:linenos:
View the Code View the Code
============= =============
examples/gat/train.py examples/gatne/main.py
.. literalinclude:: ../../../examples/gat/train.py .. literalinclude:: ../../../examples/gatne/main.py
:language: python :language: python
:linenos: :linenos:
.. mdinclude:: md/gatne_examples.md
View the Code
=============
examples/dgi/train.py
.. literalinclude:: ../../../examples/dgi/train.py
:language: python
:linenos:
.. mdinclude:: md/ges_examples.md
View the Code
=============
examples/ges/gpu_train.py
.. literalinclude:: ../../../examples/ges/gpu_train.py
:language: python
:linenos:
.. mdinclude:: md/line_examples.md
View the Code
=============
examples/line/line.py
.. literalinclude:: ../../../examples/line/line.py
:language: python
:linenos:
# PGL Examples for DGI
[Deep Graph Infomax \(DGI\)](https://arxiv.org/abs/1809.10341) is a general approach for learning node representations within graph-structured data in an unsupervised manner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.6
- pgl
### Performance
We use DGI to pretrain embeddings for each nodes. Then we fix the embedding to train a node classifier.
| Dataset | Accuracy |
| --- | --- |
| Cora | ~81% |
| Pubmed | ~77.6% |
| Citeseer | ~71.3% |
### How to run
For examples, use gpu to train gcn on cora dataset.
```
python dgi.py --dataset cora --use_cuda
python train.py --dataset cora --use_cuda
```
#### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
### View the Code
See the code [here](dgi_examples_code.html)
# PGL Examples for distributed deepwalk
[Deepwalk](https://arxiv.org/pdf/1403.6652.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce distributed deepwalk algorithms and reach the same level of indicators as the paper.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3).
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0
## How to run
We adopt [PaddlePaddle Fleet](https://github.com/PaddlePaddle/Fleet) as our distributed training frameworks ```pgl_deepwalk.cfg``` is config file for deepwalk hyperparameter and ```local_config``` is a config file for parameter servers. By default, we have 2 pservers and 2 trainers. We can use ```cloud_run.sh``` to help you startup the parameter servers and model trainers.
For examples, train deepwalk in distributed mode on BlogCataLog dataset.
```sh
# train deepwalk in distributed mode.
sh cloud_run.sh
# multiclass task example
python3 multi_class.py --use_cuda --ckpt_path ./model_path/4029 --epoch 1000
```
## Hyperparameters
- dataset: The citation dataset "BlogCatalog".
- hidden_size: Hidden size of the embedding.
- lr: Learning rate.
- neg_num: Number of negative samples.
- epoch: Number of training epoch.
### Experiment results
Dataset|model|Task|Metric|PGL Result|Reported Result
--|--|--|--|--|--
BlogCatalog|distributed deepwalk|multi-label classification|MacroF1|0.233|0.211
### View the Code
See the code [here](distribute_deepwalk_examples_code.html)
# Distribute GraphSAGE in PGL
[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.
For purpose of high scalability, we use redis as distribute graph storage solution and training graphsage against redis server.
### Datasets(Quickstart)
The reddit dataset should be downloaded from [reddit_adj.npz](https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt) and [reddit.npz](https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2Jthe). The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
Alternatively, reddit dataset has been preprocessed and packed into docker image, which can be instantly pulled using following commands.
```sh
docker pull githubutilities/reddit_redis_demo:v0.1
```
### Dependencies
```txt
- paddlepaddle>=1.6
- pgl
- scipy
- redis==2.10.6
- redis-py-cluster==1.3.6
```
### How to run
#### 1. Start reddit data service
```sh
docker run \
--net=host \
-d --rm \
--name reddit_demo \
-it githubutilities/reddit_redis_demo:v0.1 \
/bin/bash -c "/bin/bash ./before_hook.sh && /bin/bash"
docker logs -f `docker ps -aqf "name=reddit_demo"`
```
#### 2. training GraphSAGE model
```sh
python train.py --use_cuda --epoch 10 --graphsage_type graphsage_mean --sample_workers 10
```
#### Hyperparameters
- epoch: Number of epochs default (10)
- use_cuda: Use gpu if assign use_cuda.
- graphsage_type: We support 4 aggregator types including "graphsage_mean", "graphsage_maxpool", "graphsage_meanpool" and "graphsage_lstm".
- sample_workers: The number of workers for multiprocessing subgraph sample.
- lr: Learning rate.
- batch_size: Batch size.
- samples_1: The max neighbors for the first hop neighbor sampling. (default: 25)
- samples_2: The max neighbors for the second hop neighbor sampling. (default: 10)
- hidden_size: The hidden size of the GraphSAGE models.
### View the Code
See the code [here](distribute_graphsage_examples_code.html)
# PGL Examples for GATNE
[GATNE](https://arxiv.org/pdf/1905.01669.pdf) is a algorithms framework for embedding large-scale Attributed Multiplex Heterogeneous Networks(AMHN). Given a heterogeneous graph, which consists of nodes and edges of multiple types, it can learn continuous feature representations for every node. Based on PGL, we reproduce GATNE algorithm.
## Datasets
YouTube dataset contains 2000 nodes, 1310617 edges and 5 edge types. And we use YouTube dataset for example.
You can dowload YouTube datasets from [here](https://github.com/THUDM/GATNE/tree/master/data)
After downloading the data, put them, let's say, in ./data/ . Note that the current directory is the root directory of GATNE model. Then in ./data/youtube/ directory, there are three files:
* train.txt
* valid.txt
* test.txt
Then you can run the below command to preprocess the data.
```sh
python data_process.py --input_file ./data/youtube/train.txt --output_file ./data/youtube/nodes.txt
```
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
## Hyperparameters
All the hyper parameters are saved in config.yaml file. So before training GATNE model, you can open the config.yaml to modify the hyper parameters as you like.
for example, you can change the \"use_cuda\" to \"True \" in order to use GPU for training or modify \"data_path\" to use different dataset.
Some important hyper parameters in config.yaml:
- use_cuda: use GPU to train model
- data_path: the directory of dataset
- lr: learning rate
- neg_num: number of negatie samples.
- num_walks: number of walks started from each node
- walk_length: walk length
## How to run
Then run the below command:
```sh
python main.py -c config.yaml
```
### Experiment results
| | PGL result | Reported result |
|:---:|------------|-----------------|
| AUC | 84.83 | 84.61 |
| PR | 82.77 | 81.93 |
| F1 | 76.98 | 76.83 |
### View the Code
See the code [here](gatne_examples_code.html)
# PGL Examples for GES
[Graph Embedding with Side Information](https://arxiv.org/pdf/1803.02349.pdf) is an algorithmic framework for representational learning on graphs. Given any graph, it can learn continuous feature representations for the nodes, which can then be used for various downstream machine learning tasks. Based on PGL, we reproduce ges algorithms.
## Datasets
The datasets contain two networks: [BlogCatalog](http://socialcomputing.asu.edu/datasets/BlogCatalog3).
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
## How to run
For examples, train ges on cora dataset.
```sh
# train deepwalk in distributed mode.
sh gpu_run.sh
```
## Hyperparameters
- dataset: The citation dataset "BlogCatalog".
- hidden_size: Hidden size of the embedding.
- lr: Learning rate.
- neg_num: Number of negative samples.
- epoch: Number of training epoch.
### View the Code
See the code [here](ges_examples_code.html)
# PGL Examples for LINE
[LINE](http://www.www2015.it/documents/proceedings/proceedings/p1067.pdf) is an algorithmic framework for embedding very large-scale information networks. It is suitable to a variety of networks including directed, undirected, binary or weighted edges. Based on PGL, we reproduce LINE algorithms and reach the same level of indicators as the paper.
## Datasets
[Flickr network](http://socialnetworks.mpi-sws.org/data-imc2007.html) is a social network, which contains 1715256 nodes and 22613981 edges.
You can dowload data from [here](http://socialnetworks.mpi-sws.org/data-imc2007.html).
Flickr network contains four files:
* flickr-groupmemberships.txt.gz
* flickr-groups.txt.gz
* flickr-links.txt.gz
* flickr-users.txt.gz
After downloading the data,uncompress them, let's say, in **./data/flickr/** . Note that the current directory is the root directory of LINE model.
Then you can run the below command to preprocess the data.
```sh
python data_process.py
```
Then it will produce three files in **./data/flickr/** directory:
* nodes.txt
* edges.txt
* nodes_label.txt
## Dependencies
- paddlepaddle>=1.6
- pgl>1.0.0
## How to run
For examples, use gpu to train LINE on Flickr dataset.
```sh
# multiclass task example
python line.py --use_cuda --order first_order --data_path ./data/flickr/ --save_dir ./checkpoints/model/
python multi_class.py --ckpt_path ./checkpoints/model/model_epoch_20 --percent 0.5
```
## Hyperparameters
- -use_cuda: Use gpu if assign use_cuda.
- -order: LINE with First_order Proximity or Second_order Proximity
- -percent: The percentage of data as training data
### Experiment results
Dataset|model|Task|Metric|PGL Result|Reported Result
--|--|--|--|--|--
Flickr|LINE with first_order|multi-label classification|MacroF1|0.626|0.627
Flickr|LINE with first_order|multi-label classification|MicroF1|0.637|0.639
Flickr|LINE with second_order|multi-label classification|MacroF1|0.615|0.621
Flickr|LINE with second_order|multi-label classification|MicroF1|0.630|0.635
### View the Code
See the code [here](line_examples_code.html)
# PGL examples for metapath2vec
[metapath2vec](https://ericdongyx.github.io/papers/KDD17-dong-chawla-swami-metapath2vec.pdf) is a algorithm framework for representation learning in heterogeneous networks which contains multiple types of nodes and links. Given a heterogeneous graph, metapath2vec algorithm first generates meta-path-based random walks and then use skipgram model to train a language model. Based on PGL, we reproduce metapath2vec algorithm.
## Datasets
You can dowload datasets from [here](https://ericdongyx.github.io/metapath2vec/m2v.html)
We use the "aminer" data for example. After downloading the aminer data, put them, let's say, in ./data/net_aminer/ . We also need to put "label/" directory in ./data/.
## Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
## Hyperparameters
All the hyper parameters are saved in config.yaml file. So before training, you can open the config.yaml to modify the hyper parameters as you like.
for example, you can change the \"use_cuda\" to \"True \" in order to use GPU for training or modify \"data_path\" to specify the data you want.
Some important hyper parameters in config.yaml:
- **use_cuda**: use GPU to train model
- **data_path**: the directory of dataset that you want to load
- **lr**: learning rate
- **neg_num**: number of negatie samples.
- **num_walks**: number of walks started from each node
- **walk_length**: walk length
- **metapath**: meta path scheme
## Metapath randomwalk sampling
Before training, we should generate some metapath random walks to train skipgram model. we can run the below command to produce metapath randomwalk data.
```sh
python sample.py -c config.yaml
```
## Training and Testing
After finishing metapath randomwalk sampling, you can run the below command to train and test the model.
```sh
python main.py -c config.yaml
python multi_class.py --dataset ./data/out_aminer_CPAPC/author_label.txt --word2id ./checkpoints/train.metapath2vec/word2id.pkl --ckpt_path ./checkpoints/train.metapath2vec/model_epoch5/
```
## Experiment results
| train_percent | Metric | PGL Result | Reported Result |
|---------------|----------|------------|-----------------|
| 50% | macro-F1 | 0.9249 | 0.9314 |
| 50% | micro-F1 | 0.9283 | 0.9365 |
### View the Code
See the code [here](metapath2vec_examples_code.html)
# PGL Examples for SGC
[Simplifying Graph Convolutional Networks \(SGC\)](https://arxiv.org/pdf/1902.07153.pdf) is a powerful neural network designed for machine learning on graphs. Based on PGL, we reproduce SGC algorithms and reach the same level of indicators as the paper in citation network benchmarks.
### Datasets
The datasets contain three citation networks: CORA, PUBMED, CITESEER. The details for these three datasets can be found in the [paper](https://arxiv.org/abs/1609.02907).
### Dependencies
- paddlepaddle>=1.6
- pgl>=1.0.0
### Performance
We train our models for 200 epochs and report the accuracy on the test dataset.
| Dataset | Accuracy | Speed with paddle 1.5 <br> (epoch time)|
| --- | --- | ---|
| Cora | 0.818 (paper: 0.810) | 0.0015s |
| Pubmed | 0.788 (paper: 0.789) | 0.0015s |
| Citeseer | 0.719 (paper: 0.719) | 0.0015s |
### How to run
For examples, use gpu to train SGC on cora dataset.
```
python sgc.py --dataset cora --use_cuda
```
### Hyperparameters
- dataset: The citation dataset "cora", "citeseer", "pubmed".
- use_cuda: Use gpu if assign use_cuda.
### View the Code
See the code [here](sgc_examples_code.html)
# PGL Examples For Struc2Vec
[Struc2vec](https://arxiv.org/abs/1704.03165) is is a concept of symmetry in which network nodes are identified according to the network structure and their relationship to other nodes. A novel and flexible framework for learning latent representations is proposed in the paper of struc2vec. We reproduce Struc2vec algorithm in the PGL.
### DataSet
The paper of use air-traffic network to valid algorithm of Struc2vec.
The each edge in the dataset indicate that having one flight between the airports. Using the the connection between the airports to predict the level of activity. The following dataset will be used to valid the algorithm accuracy.Data collected from the Bureau of Transportation Statistics2 from January to October, 2016. The network has 1,190 nodes, 13,599 edges (diameter is 8). [Link](https://www.transtats.bts.gov/)
- usa-airports.edgelist
- labels-usa-airports.txt
### Dependencies
If use want to use the struc2vec model in pgl, please install the gensim, pathos, fastdtw additional.
- paddlepaddle>=1.6
- pgl
- gensim
- pathos
- fastdtw
### How to use
For examples, we want to train and valid the Struc2vec model on American airpot dataset
> python struc2vec.py --edge_file data/usa-airports.edgelist --label_file data/labels-usa-airports.txt --train True --valid True --opt2 True
### Hyperparameters
| Args| Meaning|
| ------------- | ------------- |
| edge_file | input file name for edges|
| label_file | input file name for node label|
| emb_file | input file name for node label|
| walk_depth| The step3 for random walk|
| opt1| The flag to open optimization 1 to reduce time cost|
| opt2| The flag to open optimization 2 to reduce time cost|
| w2v_emb_size| The dims of output the word2vec embedding|
| w2v_window_size| The context length of word2vec|
| w2v_epoch| The num of epoch to train the model.|
| train| The flag to run the struc2vec algorithm to get the w2v embedding|
| valid| The flag to use the w2v embedding to valid the classification result|
| num_class| The num of class in classification model to be trained|
### Experiment results
| Dataset | Model | Metric | PGL Result | Paper repo Result |
| ------------- | ------------- |------------- |------------- |------------- |
| American airport dataset | Struc2vec without time cost optimization| ACC |0.6483|0.6340|
| American airport dataset | Struc2vec with optimization 1| ACC |0.6466|0.6242|
| American airport dataset | Struc2vec with optimization 2| ACC |0.6252|0.6241|
| American airport dataset | Struc2vec with optimization1&2| ACC |0.6226|0.6083|
### View the Code
See the code [here](strucvec_examples_code.html)
.. mdinclude:: md/metapath2vec_examples.md
View the Code
=============
examples/metapath2vec/main.py
.. literalinclude:: ../../../examples/metapath2vec/main.py
:language: python
:linenos:
.. mdinclude:: md/sgc_examples.md
View the Code
=============
examples/sgc/sgc.py
.. literalinclude:: ../../../examples/sgc/sgc.py
:language: python
:linenos:
.. mdinclude:: md/strucvec_examples.md
View the Code
=============
examples/strucvec/struc2vec.py
.. literalinclude:: ../../../examples/strucvec/struc2vec.py
:language: python
:linenos:
...@@ -29,7 +29,15 @@ Quick Start ...@@ -29,7 +29,15 @@ Quick Start
examples/static_graph_wrapper.rst examples/static_graph_wrapper.rst
examples/node2vec_examples.rst examples/node2vec_examples.rst
examples/graphsage_examples.rst examples/graphsage_examples.rst
examples/dgi_examples.rst
examples/distribute_deepwalk_examples.rst
examples/distribute_graphsage_examples.rst
examples/ges_examples.rst
examples/line_examples.rst
examples/sgc_examples.rst
examples/strucvec_examples.rst
examples/gatne_examples.rst
examples/metapath2vec_examples.rst
.. toctree:: .. toctree::
:maxdepth: 2 :maxdepth: 2
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册