quick_start.md 5.6 KB
Newer Older
Y
yelrose 已提交
1 2
## Step 1: using PGL to create a graph 
Suppose we have a graph with 10 nodes and 14 edges as shown in the following figure:
L
liweibin 已提交
3
![A simple graph](images/quick_start_graph.png)
Y
yelrose 已提交
4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Our purpose is to train a graph neural network to classify yellow and green nodes. So we can create this graph in such way:
```python
import pgl
from pgl import graph  # import pgl module
import numpy as np

def build_graph():
    # define the number of nodes; we can use number to represent every node
    num_node = 10
    # add edges, we represent all edges as a list of tuple (src, dst)
    edge_list = [(2, 0), (2, 1), (3, 1),(4, 0), (5, 0), 
             (6, 0), (6, 4), (6, 5), (7, 0), (7, 1),
             (7, 2), (7, 3), (8, 0), (9, 7)]

    # Each node can be represented by a d-dimensional feature vector, here for simple, the feature vectors are randomly generated.
    d = 16
    feature = np.random.randn(num_node, d).astype("float32")
W
Webbley 已提交
22 23
    # each edge has it own weight
    edge_feature = np.random.randn(len(edge_list), 1).astype("float32")
Y
yelrose 已提交
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
    
    # create a graph
    g = graph.Graph(num_nodes = num_node,
                    edges = edge_list, 
                    node_feat = {'feature':feature}, 
                    edge_feat ={'edge_feature': edge_feature})

    return g

# create a graph object for saving graph data
g = build_graph()
```
After creating a graph in PGL, we can print out some information in the graph.

```python
print('There are %d nodes in the graph.'%g.num_nodes)
print('There are %d edges in the graph.'%g.num_edges)

# Out:
# There are 10 nodes in the graph.
# There are 14 edges in the graph. 
```

Currently our PGL is developed based on static computational mode of paddle (we’ll support dynamic computational model later). We need to build model upon a virtual data holder. GraphWrapper provide a virtual graph structure that users can build deep learning models based on this virtual graph. And then feed real graph data to run the models.
```python
import paddle.fluid as fluid

use_cuda = False  
K
kirayummy 已提交
52
place = fluid.CUDAPlace(0) if use_cuda else fluid.CPUPlace()
Y
yelrose 已提交
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68

# use GraphWrapper as a container for graph data to construct a graph neural network
gw = pgl.graph_wrapper.GraphWrapper(name='graph',
                        node_feat=g.node_feat_info())
```

## Step 2: create a simple Graph Convolutional Network(GCN)

In this tutorial, we use a simple Graph Convolutional Network(GCN) developed by [Kipf and Welling](https://arxiv.org/abs/1609.02907) to perform node classification. Here we use the simplest GCN structure. If readers want to know more about GCN, you can refer to the original paper.

* In layer $l$,each node $u_i^l$ has a feature vector $h_i^l$;
* In every layer,  the idea of GCN is that the feature vector $h_i^{l+1}$ of each node $u_i^{l+1}$ in the next layer are obtained by weighting the feature vectors of all the neighboring nodes and then go through a non-linear transformation.  

In PGL, we can easily implement a GCN layer as follows:
```python
# define GCN layer function
W
Webbley 已提交
69
def gcn_layer(gw, nfeat, efeat, hidden_size, name, activation):
Y
yelrose 已提交
70 71 72 73 74
    # gw is a GraphWrapper;feature is the feature vectors of nodes
    
    # define message function
    def send_func(src_feat, dst_feat, edge_feat): 
        # In this tutorial, we return the feature vector of the source node as message
W
Webbley 已提交
75
        return src_feat['h'] * edge_feat['e']
Y
yelrose 已提交
76 77 78 79 80 81 82

    # define reduce function
    def recv_func(feat):
        # we sum the feature vector of the source node
        return fluid.layers.sequence_pool(feat, pool_type='sum')

    # trigger message to passing
W
Webbley 已提交
83
    msg = gw.send(send_func, nfeat_list=[('h', nfeat)], efeat_list=[('e', efeat)])
Y
yelrose 已提交
84 85 86 87 88 89 90 91 92 93 94
    # recv funciton receives message and trigger reduce funcition to handle message 
    output = gw.recv(msg, recv_func)
    output = fluid.layers.fc(output,
                    size=hidden_size,
                    bias_attr=False,
                    act=activation,
                    name=name)
    return output
```
After defining the GCN layer, we can construct a deeper GCN model with two GCN layers.
```python
W
Webbley 已提交
95
output = gcn_layer(gw, gw.node_feat['feature'], gw.edge_feat['edge_feature'],
Y
yelrose 已提交
96
                hidden_size=8, name='gcn_layer_1', activation='relu')
W
Webbley 已提交
97 98
output = gcn_layer(gw, output, gw.edge_feat['edge_feature'],
                hidden_size=1, name='gcn_layer_2', activation=None)
Y
yelrose 已提交
99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
```

## Step 3:  data preprocessing
Since we implement a node binary classifier, we can use 0 and 1 to represent two classes respectively.
```python 
y = [0,1,1,1,0,0,0,1,0,1]
label = np.array(y, dtype="float32")
label = np.expand_dims(label, -1)
```

## Step 4:  training program
The training process of GCN is the same as that of other paddle-based models.

- First we create a loss function. 
- Then we create a optimizer.
- Finally, we create a executor and train the model. 

```python
# create a label layer as a container 
node_label = fluid.layers.data("node_label", shape=[None, 1],
            dtype="float32", append_batch_size=False)

# using cross-entropy with sigmoid layer as the loss function
loss = fluid.layers.sigmoid_cross_entropy_with_logits(x=output, label=node_label)

# calculate the mean loss
loss = fluid.layers.mean(loss)

# choose the Adam optimizer and set the learning rate to be 0.01
adam = fluid.optimizer.Adam(learning_rate=0.01)
adam.minimize(loss)

# create the executor 
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())
feed_dict = gw.to_feed(g) # gets graph data

for epoch in range(30):
    feed_dict['node_label'] = label
    
    train_loss = exe.run(fluid.default_main_program(),
        feed=feed_dict,
        fetch_list=[loss],
        return_numpy=True)
    print('Epoch %d | Loss: %f'%(epoch, train_loss[0]))
```