27
loading...
This website collects cookies to deliver better user experience
torch_geometric
, the geometric deep learning extension library for PyTorch
.torch_geometric
(PyTorch Geometric), a geometric deep learning extension library for PyTorch
, provides many variations of deep learning on graphs and other irregular structures.GCNConv
which is implemented from the paper “Semi-supervised Classification with Graph Convolutional Networks”, if you would like to have a look at how it was developed.musae_git_edges.csv
which contains the edges' indices.musae_git_features.json
which contains the nodes' features.musae_git_target.csv
which contains the targets, i.e. node labels.torch_geometric
please follow strictly the installation guide here, there are different options if you want to install using Pip Wheels, please check your syntax to download the compatible version based on your machine and software.NetworkX
Python package, we will use it to visualize the graph, I installed with pip. From torch_geometric
we need to import AddTrainValTestMask
, this will help us to segregate the training and test sets later. We also need json
,pandas
and numpy
and some other packages.%matplotlib inline
import json
import collections
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.transforms import AddTrainValTestMask as masking
from torch_geometric.utils.convert import to_networkx
from torch_geometric.nn import GCNConv
import networkx as nx
data
. We read them and plot the top 5 rows and the last 5 rows from the labels file. Even though we see 4 columns but only 2 columns concern us here: id
of the node (i.e. user) and ml_target
which is 1 if the user is a machine learning community user and 0 otherwise. Now, we are sure that our task is a binary classification problem, as we only have 2 classes.with open("data/musae_git_features.json") as json_data:
data_raw = json.load(json_data)
edges=pd.read_csv("data/musae_git_edges.csv")
target_df=pd.read_csv("data/musae_git_target.csv")
print("5 top nodes labels")
print(target_df.head(5).to_markdown())
print()
print("5 last nodes")
print(target_df.tail(5).to_markdown())
plt.hist(target_df.ml_target,bins=4);
plt.title("Classes distribution")
plt.show()
plt.hist(feat_counts,bins=20)
plt.title("Number of features per graph distribution")
plt.show()
plt.hist(feats,bins=50)
plt.title("Features distribution")
plt.show()
encode_data
. Our plan is to use this function to encode a light subset of the graph (e.g. only 60 nodes) for the purpose of visualization. Here is the function.def encode_data(light=False,n=60):
if light==True:
nodes_included=n
elif light==False:
nodes_included=len(data_raw)
data_encoded={}
for i in range(nodes_included):#
one_hot_feat=np.array([0]*(max(feats)+1))
this_feat=data_raw[str(i)]
one_hot_feat[this_feat]=1
data_encoded[str(i)]=list(one_hot_feat)
if light==True:
sparse_feat_matrix=np.zeros((1,max(feats)+1))
for j in range(nodes_included):
temp=np.array(data_encoded[str(j)]).reshape(1,-1)
sparse_feat_matrix=np.concatenate((sparse_feat_matrix,temp),axis=0)
sparse_feat_matrix=sparse_feat_matrix[1:,:]
return(data_encoded,sparse_feat_matrix)
elif light==False:
return(data_encoded, None)
data_encoded_vis,sparse_feat_matrix_vis=encode_data(light=True,n=60)
plt.figure(figsize=(25,25));
plt.imshow(sparse_feat_matrix_vis[:,:250],cmap='Greys');
torch_geometric.data.Data
which is a plain old python object to model a single graph with various (optional) attributes. We will construct our graph object using this class and passing the following attributes, noting that all arguments are torch tensors. x
: will be assigned to the encoded node features, its shape is [number_of_nodes, number_of_features]
.y
: will be assigned to the node labels, its shape is [number_of_nodes]
edge_index
: to represent an undirected graph, we need to extend the original edge indices in a way that we can have two separate directed edges connecting the same two nodes but pointing opposite to each other. For example, we need to have 2 edges between node 100 and node 200, one edge points from 100 to 200 and the other points from 200 to 100. This is a way to represent the undirected graph if we are given the edge indecies. The tensor shape will be [2,2*number_of_original_edges]
.Data
class is very abstract in a sense that you can add any attribute that you think it describes your graph. For instance we can add a metadata attribute g["meta_data"]="bla bla bla"
which make it flexible to encapsulate any information you would like. Now, we will build construct_graph
function which does the following:def construct_graph(data_encoded,light=False):
node_features_list=list(data_encoded.values())
node_features=torch.tensor(node_features_list)
node_labels=torch.tensor(target_df['ml_target'].values)
edges_list=edges.values.tolist()
edge_index01=torch.tensor(edges_list, dtype = torch.long).T
edge_index02=torch.zeros(edge_index01.shape, dtype = torch.long)#.T
edge_index02[0,:]=edge_index01[1,:]
edge_index02[1,:]=edge_index01[0,:]
edge_index0=torch.cat((edge_index01,edge_index02),axis=1)
g = Data(x=node_features, y=node_labels, edge_index=edge_index0)
g_light = Data(x=node_features[:,0:2],
y=node_labels ,
edge_index=edge_index0[:,:55])
if light:
return(g_light)
else:
return(g)
draw_graph
function. We need to convert our homogeneous graph to a NetworkX graph then pot using NetworkX.draw
.def draw_graph(data0):
#node_labels=data0.y
if data0.num_nodes>100:
print("This is a big graph, can not plot...")
return
else:
data_nx = to_networkx(data0)
node_colors=data0.y[list(data_nx.nodes)]
pos= nx.spring_layout(data_nx,scale =1)
plt.figure(figsize=(12,8))
nx.draw(data_nx, pos, cmap=plt.get_cmap('Set1'),
node_color =node_colors,node_size=600,connectionstyle="angle3",
width =1, with_labels = False, edge_color = 'k', arrowstyle = "-")
construct_graph
with light=True
. Then we can pass it to draw_graph
, to show the following graph. You can see how the nodes are connected by edges and labeled by color.g_sample=construct_graph(data_encoded=data_encoded_vis,light=True)
draw_graph(g_sample)
encode_data
with light=False
and construct the full graph by calling construct_graph
with light=False
. We will not try to visualize this big graph as I am assuming that you are using your local computer with limited resources.data_encoded,_=encode_data(light=False)
g=construct_graph(data_encoded=data_encoded,light=False)
torch_geometric.transforms.AddTrainValTestMask
class can take our graph and let us set how we want our masks to formed and it will add a node-level split via train_mask
, val_mask
and test_mask
attributes. In our training, we use 30% as a validation set and 60% as test set while we keep only 10% for the training. You may have different split ratios, but in this way we may have a more realistic performance and we will not overfit easily (I know you might disagree with me on this point)! We can also print the graph information and the number of nodes each set (mask). The numbers between the brackets is the shape of each attribute tensor.msk=masking(split="train_rest", num_splits = 1, num_val = 0.3, num_test= 0.6)
g=msk(g)
print(g)
print()
print("training samples",torch.sum(g.train_mask).item())
print("validation samples",torch.sum(g.val_mask ).item())
print("test samples",torch.sum(g.test_mask ).item())
torch_geometric.nn.GCNConv
class, however there are many other layers you can try on PyTorch Geometric documentation.GCNConv
layers, the first has input features equal to the number of features in our graph and some arbitrary number of output features f
. Then we apply a relu
activation function and deliver the latent features to the second layer which has output nodes equal to the number our classes (i.e. 2). In the forward
function, GCNConv
can accept many arguments x
as the nodes features, edge_index
and edge_weight
, in our case we only use the first two arguments.class SocialGNN(torch.nn.Module):
def __init__(self,num_of_feat,f):
super(SocialGNN, self).__init__()
self.conv1 = GCNConv(num_of_feat, f)
self.conv2 = GCNConv(f, 2)
def forward(self, data):
x = data.x.float()
edge_index = data.edge_index
x = self.conv1(x=x, edge_index=edge_index)
x = F.relu(x)
x = self.conv2(x, edge_index)
return x
masked_loss
and masked_accuracy
for which we pass the respective masks and it returns the corresponding loss and accuracy. The idea is to calculate the loss and accuracy for all nodes and multiply it by the mask to zero out the not required nodes.def masked_loss(predictions,labels,mask):
mask=mask.float()
mask=mask/torch.mean(mask)
loss=criterion(predictions,labels)
loss=loss*mask
loss=torch.mean(loss)
return (loss)
def masked_accuracy(predictions,labels,mask):
mask=mask.float()
mask/=torch.mean(mask)
accuracy=(torch.argmax(predictions,axis=1)==labels).long()
accuracy=mask*accuracy
accuracy=torch.mean(accuracy)
return (accuracy)
torch.optim.Adam
optimizer. We will run the training for some number of epochs and we keep track for the best validation accuracy. We also plot the losses and accuracies across epochs for the training.def train_social(net,data,epochs=10,lr=0.01):
optimizer = torch.optim.Adam(net.parameters(), lr=lr) # 00001
best_accuracy=0.0
train_losses=[]
train_accuracies=[]
val_losses=[]
val_accuracies=[]
test_losses=[]
test_accuracies=[]
for ep in range(epochs+1):
optimizer.zero_grad()
out=net(data)
loss=masked_loss(predictions=out,
labels=data.y,
mask=data.train_mask)
loss.backward()
optimizer.step()
train_losses+=[loss]
train_accuracy=masked_accuracy(predictions=out,
labels=data.y,
mask=data.train_mask)
train_accuracies+=[train_accuracy]
val_loss=masked_loss(predictions=out,
labels=data.y,
mask=data.val_mask)
val_losses+=[val_loss]
val_accuracy=masked_accuracy(predictions=out,
labels=data.y,
mask=data.val_mask)
val_accuracies+=[val_accuracy]
test_accuracy=masked_accuracy(predictions=out,
labels=data.y,
mask=data.test_mask)
test_accuracies+=[test_accuracy]
if np.round(val_accuracy,4)> np.round(best_accuracy ,4):
print("Epoch {}/{}, Train_Loss: {:.4f}, Train_Accuracy: {:.4f}, Val_Accuracy: {:.4f}, Test_Accuracy: {:.4f}"
.format(ep+1,epochs, loss.item(), train_accuracy, val_accuracy, test_accuracy))
best_accuracy=val_accuracy
plt.plot(train_losses)
plt.plot(val_losses)
plt.plot(test_losses)
plt.show()
plt.plot(train_accuracies)
plt.plot(val_accuracies)
plt.plot(test_accuracies)
plt.show()
nn.CrossEntropyLoss
as our loss criterion. Below, we can see that our simple model reached a very decent accuracy on the test set, more than 87%. We can also see the learning curves (losses) and accuracies development through epochs on the top and bottom plots respectively.num_of_feat=g.num_node_features
net=SocialGNN(num_of_feat=num_of_feat,f=16)
criterion=nn.CrossEntropyLoss()
train_social(net,g,epochs=50,lr=0.1)
PyTorch
framework using torch_geometric
extension. You might want to access the github repository of this code. torch_geometric
offers.