Coding: Graph Neural Networks Implementation

Author

Sadamori Kojaku

Published

July 27, 2025

1 Preliminaries: Image Processing

Graph Neural Networks are a type of neural network for graph data. node2vec and deepwalk stem from the idea of language modeling. In this module, we will focus on another branch of graph neural networks that stem from image processing.

Edge Detection Problem in Image Processing

Edge detection is a classical problem in image processing. The goal is to identify the boundaries of objects in an image.

To approach the problem, let us first remind that an image is a matrix of pixels. Each pixel has RGB values, each of which represents the intensity of red, green, and blue color. To simplify the problem, we focus on grayscale images, in which each pixel has only one value representing the brightness. In this case, an image can be represented as a 2D matrix, where each element in the matrix represents the brightness of a pixel.

An example

Human eyes are very sensitive to brightness changes. An edge in an image appears when there is a significant brightness change between adjacent pixels. To be more concrete, let’s consider a small example consisting of 6x6 pixels, with a vertical line from the top to the bottom, where the brightness is higher than the neighboring pixels. This is an edge we want to detect.

X = \begin{bmatrix} 10 & 10 & 80 & 10 & 10 & 10 \\ 10 & 10 & 80 & 10 & 10 & 10 \\ 10 & 10 & 80 & 10 & 10 & 10 \\ 10 & 10 & 80 & 10 & 10 & 10 \\ 10 & 10 & 80 & 10 & 10 & 10 \\ 10 & 10 & 80 & 10 & 10 & 10 \end{bmatrix}

Let’s zoom on the pixel at (3, 3) and its surrounding pixels.

Z = \begin{bmatrix} 10 & 80 & 10 \\ \textcolor{blue}{10} & \textcolor{red}{80} & \textcolor{purple}{10} \\ 10 & 80 & 10 \end{bmatrix}

where the central pixel is highlighted in red. Since we are interested in the edge which is a sudden change in brightness along the horizontal direction, we take a derivative at the central pixel by

\nabla Z_{22} = \textcolor{blue}{Z_{2,1}} - \textcolor{purple}{Z_{2,3}}

Following the same process, we can compute the derivative at all pixels, which gives us the (horizontal) derivative of the image.

\begin{bmatrix} - & -70 & 0 & 70 & 0 & - \\ - & -70 & 0 & 70 & 0 & - \\ - & -70 & 0 & 70 & 0 & - \\ - & -70 & 0 & 70 & 0 & - \\ - & -70 & 0 & 70 & 0 & - \end{bmatrix}

The symbol - indicates that the derivative is not defined because one of the neighboring pixels is out of the image boundary. We observe that the derivative is high at the edge and low elsewhere. This is a simple but effective way to detect edges in an image.

We can consider a derivative operator along the vertical direction that computes the difference between the vertical neighboring pixels.

\nabla Z_{22} = Z_{1,2} - Z_{3,2}

And, when applied to the entire image, the result is

\begin{bmatrix} - & - & - & - & - & - \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ - & - & - & - & - & - \end{bmatrix}

The all entries are zero, meaning that there is no edge in the vertical direction.

We can combine the horizontal and vertical derivatives to get the gradient of the image. For example,

\nabla Z_{22} = Z_{12} - Z_{32} + Z_{21} - Z_{23}

When applied to the entire image, the result is the same as the horizontal derivative.

Convolution

We observe that there is a repeated pattern in the derivative computation: we are taking addition and subtraction of neighbiring pixels. This motivates us to generalize the operation to a more general form.

\nabla Z_{22} = \sum_{i=-1}^1 \sum_{j=-1}^1 K_{h-(i+1),w-(j+1)} Z_{2+i, 2+j}

where K is a 3 \times 3 matrix, and w=h=3 represent the width and height of the kernel.

K_{\text{horizontal}} = \begin{bmatrix} 0 & 0 & 0 \\ -1 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix},\quad K_{\text{vertical}} = \begin{bmatrix} 0 & -1 & 0 \\ 0 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix}

The operation of K on the image is called convolution, and K is called the kernel or filter. More generally, the convolution of a kernel K and an image X is defined as

Y_{ij} = \sum_{p}\sum_{q} K_{pq} X_{i+p-\frac{h+1}{2}, j+q-\frac{w+1}{2}}

where h and w are the height and width of the kernel, respectively.

2 From Image to Graph

Analogy between image and graph data

We can think of a convolution of an image from the perspective of networks. In the convolution of an image, a pixel is convolved with its neighbors. We can regard each pixel as a node, and each node is connected to its neighboring nodes (pixels) that are involved in the convolution.

Building on this analogy, we can extend the idea of convolution to general graph data. Each node has a pixel value(s) (e.g., feature vector), which is convolved with the values of its neighbors in the graph. This is the key idea of graph convolutional networks. But, there is a key difference: while the number of neighbors for an image is homogeneous, the number of neighbors for a node in a graph can be heterogeneous. Each pixel has the same number of neighbors (except for the boundary pixels), but nodes in a graph can have very different numbers of neighbors. This makes it non-trivial to define the “kernel” for graph convolution.

Spectral filter on graphs

Just like we can define a convolution on images in the frequency domain, we can also define a ‘’frequency domain’’ for graphs.

Consider a network of N nodes, where each node has a feature variable {\mathbf x}_i \in \mathbb{R}. We are interested in:

J = \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N A_{ij}(x_i - x_j)^2,

where A_{ij} is the adjacency matrix of the graph. The quantity J represents the total variation of x between connected nodes; a small J means that connected nodes have similar x (low variation; low frequency), while a large J means that connected nodes have very different x (high variation; high frequency).

We can rewrite J as

J = \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N A_{ij}(x_i - x_j)^2 = {\bf x}^\top {\bf L} {\bf x},

where {\bf L} is the Laplacian matrix of the graph given by

L_{ij} = \begin{cases} -1 & \text{if } i \text{ and } j \text{ are connected} \\ k_i & \text{if } i = j \\ 0 & \text{otherwise} \end{cases}.

and {\bf x} = [x_1,x_2,\ldots, x_N]^\top is a column vector of feature variables.

Detailed derivation

:tag: note :class: dropdown

The above derivation shows that the total variation of x between connected nodes is proportional to {\bf x}^\top {\bf L} {\bf x}.

\begin{aligned} J &= \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N A_{ij}(x_i - x_j)^2 \\ &= \frac{1}{2}\sum_{i=1}^N\sum_{j=1}^N \underbrace{A_{ij}\left( x_i^2 +x_j^2\right)}_{\text{symmetric}} - \sum_{i=1}^N\sum_{j=1}^N A_{ij}x_ix_j \\ &= \sum_{i=1}^Nx_i^2\underbrace{\sum_{j=1}^N A_{ij}}_{\text{degree of node } i, k_i} - \sum_{i=1}^N\sum_{j=1}^N A_{ij}x_ix_j \\ &= \sum_{i=1}^Nx_i^2 k_i - \sum_{i=1}^N\sum_{j=1}^N A_{ij}x_ix_j \\ &= \underbrace{[x_1,x_2,\ldots, x_N]}_{{\bf x}} \underbrace{\begin{bmatrix} k_1 & 0 & \cdots & 0 \\ 0 & k_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & k_N \end{bmatrix}}_{{\bf D}} \underbrace{\begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_N \end{bmatrix}}_{{\bf x}} - 2\underbrace{\sum_{i=1}^N\sum_{j=1}^N A_{ij}}_{{\bf x}^\top {\mathbf A} {\bf x}} {\bf x} \\ &= {\bf x}^\top {\bf D} {\bf x} - {\bf x}^\top {\mathbf A} {\bf x} \\ &= {\bf x}^\top {\bf L} {\bf x}, \end{aligned}

Let us showcase the analogy between the Fourier transform and the Laplacian matrix. In the Fourier transform, a signal is decomposed into sinusoidal basis functions. Similarly, for a graph, we can decompose the variation J into eigenvector bases.

J = \sum_{i=1}^N \lambda_i {\bf x}^\top {\mathbf u}_i {\mathbf u}_i^\top {\bf x} = \sum_{i=1}^N \lambda_i ||{\bf x}^\top {\mathbf u}_i||^2.

where {\mathbf u}_i is the eigenvector corresponding to the eigenvalue \lambda_i. - The term ({\bf x}^\top {\mathbf u}_i) is a dot-product between the feature vector {\bf x} and the eigenvector {\mathbf u}_i, which measures how much {\bf x} coheres with eigenvector {\mathbf u}_i, similar to how Fourier coefficients measure coherency with sinusoids. - Each ||{\bf x}^\top {\mathbf u}_i||^2 is the ‘’strength’’ of {\bf x} with respect to the eigenvector {\mathbf u}_i, and the total variation J is a weighted sum of these strengths.

Some eigenvectors correspond to low-frequency components, while others correspond to high-frequency components. For example, the total variation J for an eigenvector {\mathbf u}_i is given by

J = \frac{1}{2} \sum_{j}\sum_{\ell} A_{j\ell}(u_{ij} - u_{i\ell})^2 = {\mathbf u}_i^\top {\mathbf L} {\mathbf u}_i = \lambda_i.

This equation provides key insight into the meaning of eigenvalues:

  1. For an eigenvector {\mathbf u}_i, its eigenvalue \lambda_i measures the total variation for {\mathbf u}_i.
  2. Large eigenvalues mean large differences between neighbors (high frequency), while small eigenvalues mean small differences (low frequency).

Thus, if {\bf x} aligns well with {\mathbf u}_i with a large \lambda_i, then {\bf x} has a strong high-frequency component; if {\bf x} aligns well with {\mathbf u}_i with a small \lambda_i, then {\bf x} has strong low-frequency component.

Spectral Filtering

Eigenvalues \lambda_i can be thought of as a filter that controls which frequency components pass through. Instead of using the filter associated with the Laplacian matrix, we can design a filter h(\lambda_i) to control which frequency components pass through. This leads to the idea of spectral filtering. Two common filters are:

  1. Low-pass Filter: h_{\text{low}}(\lambda) = \frac{1}{1 + \alpha\lambda}
    • Preserves low frequencies (small λ)
    • Suppresses high frequencies (large λ)
    • Results in smoother signals
  2. High-pass Filter: h_{\text{high}}(\lambda) = \frac{\alpha\lambda}{1 + \alpha\lambda}
    • Preserves high frequencies
    • Suppresses low frequencies
    • Emphasizes differences between neighbors
:tags: [remove-input]

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("talk")

alpha = 1
lambdas = np.linspace(0, 10, 100)
h_low = 1 / (1 + alpha * lambdas)
h_high = (alpha * lambdas) / (1 + alpha * lambdas)

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
sns.lineplot(x=lambdas, y=h_low, label="Low-pass filter", ax=axes[0])
axes[0].legend(frameon=False).remove()
sns.lineplot(x=lambdas, y=h_high, label="High-pass filter", ax=axes[1])
axes[1].legend(frameon=False).remove()
axes[0].set_title("Low-pass filter")
axes[1].set_title("High-pass filter")
fig.text(0.5, 0.01, "Eigenvalue $\lambda$", ha="center")
axes[0].set_ylabel("Filter response $h(\lambda)$")
sns.despine()
plt.tight_layout()

Example

Let us showcase the idea of spectral filtering with a simple example with the karate club network.

:tags: [remove-input]
import igraph as ig
import numpy as np
from scipy import sparse
import matplotlib as mpl

G = ig.Graph.Famous("Zachary")
A = G.get_adjacency_sparse()

We will first compute the laplacian matrix and its eigendecomposition.

# Compute Laplacian matrix
deg = np.array(A.sum(axis=1)).reshape(-1)
D = sparse.diags(deg)
L = D - A

# Compute eigendecomposition
evals, evecs = np.linalg.eigh(L.toarray())

# Sort eigenvalues and eigenvectors
order = np.argsort(evals)
evals = evals[order]
evecs = evecs[:, order]

Now, let’s create a low-pass and high-pass filter.

alpha = 2
L_low = evecs @ np.diag(1 / (1 + alpha * evals)) @ evecs.T
L_high = evecs @ np.diag(alpha * evals / (1 + alpha * evals)) @ evecs.T

print("Size of low-pass filter:", L_low.shape)
print("Size of high-pass filter:", L_high.shape)

Notice that the high-pass filter and low-pass filter are matrices of the same size as the adjacency matrix A, which defines a ‘convolution’ on the graph as follows:

{\bf x}' = {\bf L}_{\text{low}} {\bf x} \quad \text{or} \quad {\bf x}' = {\bf L}_{\text{high}} {\bf x}.

where {\bf L}_{\text{low}} and {\bf L}_{\text{high}} are the low-pass and high-pass filters, respectively, and {\bf x}' is the convolved feature vector.

Now, let’s see how these filters work. Our first example is a random feature vector.

# Random feature vector
x = np.random.randn(A.shape[0], 1)

# Convolve with low-pass filter
x_low = L_low @ x

# Convolve with high-pass filter
x_high = L_high @ x

Let us visualize the results.

:tags: [hide-input]

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
palette = sns.color_palette("viridis", as_cmap=True)
norm = mpl.colors.Normalize(vmin=-0.3, vmax=0.3)

# Original
values = x.reshape(-1)
values /= np.linalg.norm(values)
ig.plot(G, vertex_color=[palette(norm(x)) for x in values], bbox=(0, 0, 500, 500), vertex_size=20, target=axes[0])
axes[0].set_title("Original")

# Low-pass filter applied
values = L_low @ x
values /= np.linalg.norm(values)
values = values.reshape(-1)
ig.plot(G, vertex_color=[palette(norm(x)) for x in values], bbox=(0, 0, 500, 500), vertex_size=20, target=axes[1])
axes[1].set_title("Low-pass filter")

# High-pass filter applied
values = L_high @ x
values /= np.linalg.norm(values)
values = values.reshape(-1)
ig.plot(G, vertex_color=[palette(norm(x)) for x in values], bbox=(0, 0, 500, 500), vertex_size=20, target=axes[2])
axes[2].set_title("High-pass filter")
fig.tight_layout()

We observe that the low-pass filter results in smoother {\bf x} between connected nodes (i.e., neighboring nodes have similar {\bf x}). The original {\bf x} and {\bf x}'_{\text{low}} are very similar because random variables are high-frequency components. In contrast, when we apply the high-pass filter, {\bf x}'_{\text{high}} is similar to {\bf x} because the high-frequency components are not filtered.

Let’s now use an eigenvector as our feature vector {\bf x}.

:tags: [hide-input]
eigen_centrality = np.array(G.eigenvector_centrality()).reshape(-1, 1)
low_pass_eigen = L_low @ eigen_centrality
high_pass_eigen = L_high @ eigen_centrality

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
palette = sns.color_palette("viridis", as_cmap=True)

norm = mpl.colors.Normalize(vmin=-0, vmax=0.3)
values = eigen_centrality.reshape(-1)# high_pass_random.reshape(-1)
values /= np.linalg.norm(values)
values = values.reshape(-1)
ig.plot(G, vertex_color=[palette(norm(x)) for x in values], bbox=(0, 0, 500, 500), vertex_size=20, target=axes[0])
axes[0].set_title("Original")

values = low_pass_eigen.reshape(-1)
values /= np.linalg.norm(values)
values = values.reshape(-1)
ig.plot(G, vertex_color=[palette(norm(x)) for x in values], bbox=(0, 0, 500, 500), vertex_size=20, target=axes[1])
axes[1].set_title("Low-pass filter")

values = high_pass_eigen.reshape(-1)
values /= np.linalg.norm(values)
ig.plot(G, vertex_color=[palette(norm(x)) for x in values], bbox=(0, 0, 500, 500), vertex_size=20, target=axes[2])
axes[2].set_title("High-pass filter")
fig.tight_layout()

The high-pass filter increases the contrast of the eigenvector centrality, emphasizing the differences between nodes. On the other hand, the low-pass filter smooths out the eigenvector centrality.

3 Graph Convolutional Networks

We have seen that spectral filters give us a principled way to think about “convolution” on irregular graph structures, and controlling the frequency components brings out different aspects of the data. We now go one step further: instead of designing filters by hand, we can learn them from data for specific tasks.

Spectral Graph Convolutional Networks

A simplest form of learnable spectral filter is given by

{\bf L}_{\text{learn}} = \sum_{k=1}^K \theta_k {\mathbf u}_k {\mathbf u}_k^\top,

where {\mathbf u}_k are the eigenvectors and \theta_k are the learnable parameters. The variable K is the number of eigenvectors used (i.e., the rank of the filter). The weight \theta_k is learned to maximize the performance of the task at hand.

Building on this idea, {footcite}bruna2014spectral added a nonlinearity to the filter and proposed a spectral convolutional neural network (GCN) by

{\bf x}^{(\ell+1)} = h\left( L_{\text{learn}} {\bf x}^{(\ell)}\right),

where h is an activation function, and {\bf x}^{(\ell)} is the feature vector of the \ell-th convolution. They further extend this idea to convolve on multidimensional feature vectors, {\bf X} \in \mathbb{R}^{N \times f_{\text{in}}} to produce new feature vectors of different dimensionality, {\bf X}' \in \mathbb{R}^{N \times f_{\text{out}}}.

\begin{aligned} {\bf X}^{(\ell+1)}_i &= h\left( \sum_j L_{\text{learn}}^{(i,j)} {\bf X}^{(\ell)}_j\right),\quad \text{where} \quad L^{(i,j)}_{\text{learn}} = \sum_{k=1}^K \theta_{k, (i,j)} {\mathbf u}_k {\mathbf u}_k^\top, \end{aligned}

Notice that the learnable filter L_{\text{learn}}^{(i,j)} is defined for each pair of input i and output j dimensions.

Many GCNs simple when it comes to implementation despite the complicated formula. And this is one of my ways to learn GNNs. Check out the [Appendix for the Python implementation](appendix.md).

From Spectral to Spatial

Spectral GCNs are mathematically elegant but have two main limitations: 1. Computational Limitation: Computing the spectra of the Laplacian is expensive {\cal O}(N^3) and prohibitive for large graphs 2. Spatial Locality: The learned filters are not spatially localized. A node can be influenced by all other nodes in the graph.

These two limitations motivate the development of spatial GCNs.

ChebNet

ChebNet {footcite}defferrard2016convolutional is one of the earliest spatial GCNs that bridges the gap between spectral and spatial domains. The key idea is to leverage Chebyshev polynomials to approximate {\bf L}_{\text{learn}} by

{\bf L}_{\text{learn}} \approx \sum_{k=0}^{K-1} \theta_k T_k(\tilde{{\bf L}}), \quad \text{where} \quad \tilde{{\bf L}} = \frac{2}{\lambda_{\text{max}}}{\bf L} - {\bf I},

where \tilde{{\bf L}} is the scaled and normalized Laplacian matrix in order to have eigenvalues in the range of [-1,1]. The Chebyshev polynomials T_k(\tilde{{\bf L}}) transforms the eigenvalues \tilde{{\bf L}} to the following recursively:

\begin{aligned} T_0(\tilde{{\bf L}}) &= {\bf I} \\ T_1(\tilde{{\bf L}}) &= \tilde{{\bf L}} \\ T_k(\tilde{{\bf L}}) &= 2\tilde{{\bf L}} T_{k-1}(\tilde{{\bf L}}) - T_{k-2}(\tilde{{\bf L}}) \end{aligned}

We then replace {\bf L}_{\text{learn}} in the original spectral GCN with the Chebyshev polynomial approximation:

{\bf x}^{(\ell+1)} = h\left( \sum_{k=0}^{K-1} \theta_k T_k(\tilde{{\bf L}}){\bf x}^{(\ell)}\right),

where: - T_k(\tilde{{\bf L}}) applies the k-th Chebyshev polynomial to the scaled Laplacian matrix - \theta_k are the learnable parameters - K is the order of the polynomial (typically small, e.g., K=3)

Graph Convolutional Networks by Kipf and Welling

While ChebNet offers a principled way to approximate spectral convolutions, Kipf and Welling (2017) {footcite}kipf2017semi proposed an even simpler and highly effective variant called Graph Convolutional Networks (GCN).

First-order Approximation

The key departure is to use the first-order approximation of the Chebyshev polynomials.

g_{\theta'} * x \approx \theta'_0x + \theta'_1(L - I_N)x = \theta'_0x - \theta'_1D^{-\frac{1}{2}}AD^{-\frac{1}{2}}x

This is crude approximation but it leads to a much simpler form, leaving only two learnable parameters, instead of K parameters in the original ChebNet.

Additionally, they further simplify the formula by using the same \theta for both remaining parameters (i.e., \theta_0 = \theta and \theta_1 = -\theta). The result is the following convolutional filter:

g_{\theta} * x \approx \theta(I_N + D^{-\frac{1}{2}}AD^{-\frac{1}{2}})x

While this is a very simple filter, one can stack multiple layers of convolutions to perform high-order graph convolutions.

Deep GCNs can suffer from over-smoothing

GCN models can be deep, and when they are too deep, they start suffering from an ill-posed problem called gradient vanishing/exploding, where the gradients of the loss function becomes too small or too large to update the model parameters. It is a common problem in deep learning.

To facilitate the training of deep GCNs, the authors introduce a very simple trick called renormalization. The idea is to add self-connections to the graph:

\tilde{A} = A + I_N, \quad \text{and} \quad \tilde{D}_{ii} = \sum_j \tilde{A}_{ij}

And use \tilde{A} and \tilde{D} to form the convolutional filter.

Altogether, this leads to the following layer-wise propagation rule:

X^{(\ell+1)} = \sigma(\tilde{D}^{-\frac{1}{2}}\tilde{A}\tilde{D}^{-\frac{1}{2}}X^{(\ell)}W^{(\ell)})

where: - X^{(\ell)} is the matrix of node features at layer \ell - W^{(\ell)} is the layer’s trainable weight matrix - \sigma is a nonlinear activation function (e.g., ReLU)

These simplifications offer several advantages: - Efficiency: Linear complexity in number of edges - Localization: Each layer only aggregates information from immediate neighbors - Depth: Fewer parameters allow building deeper models - Performance: Despite (or perhaps due to) its simplicity, it often outperforms more complex models

Exercise

:class: note

Let’s implement a simple GCN model for node classification. Coding Exercise