README.md 1.6 KB
Newer Older
Y
Yelrose 已提交
1 2 3 4 5 6 7 8
# Distribute GraphSAGE in PGL

[GraphSAGE](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf) is a general inductive framework that leverages node feature
information (e.g., text attributes) to efficiently generate node embeddings for previously unseen data. Instead of training individual embeddings for each node, GraphSAGE learns a function that generates embeddings by sampling and aggregating features from a node’s local neighborhood. Based on PGL, we reproduce GraphSAGE algorithm and reach the same level of indicators as the paper in Reddit Dataset. Besides, this is an example of subgraph sampling and training in PGL.

For purpose of high scalability, we use redis as distribute graph storage solution and training graphsage against redis server.

### Datasets(Quickstart)
9
The reddit dataset should be downloaded from [reddit_adj.npz](https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt) and [reddit.npz](https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J). The details for Reddit Dataset can be found [here](https://cs.stanford.edu/people/jure/pubs/graphsage-nips17.pdf).
Y
Yelrose 已提交
10

11 12
- reddit.npz: https://drive.google.com/open?id=19SphVl_Oe8SJ1r87Hr5a6znx3nJu1F2J
- reddit_adj.npz: https://drive.google.com/open?id=174vb0Ws7Vxk_QTUtxqTgDHSQ4El4qDHt
Y
Yelrose 已提交
13

14
Download `reddit.npz` and `reddit_adj.npz` into `data` directory for further preprocessing.
Y
Yelrose 已提交
15 16 17

### Dependencies

18 19
```sh
pip install -r requirements.txt
Y
Yelrose 已提交
20 21 22 23
```

### How to run

24
#### 1. Preprocessing and start reddit data service
Y
Yelrose 已提交
25 26

```sh
27 28 29
pushd ./redis_setup
    /bin/bash ./before_hook.sh
popd
Y
Yelrose 已提交
30 31 32 33 34
```

#### 2. training GraphSAGE model

```sh
35
sh ./cloud_run.sh
Y
Yelrose 已提交
36 37
```