Network Embedding Concepts

Author

Sadamori Kojaku

Published

October 6, 2025

1 What is Network Embedding?

Figure 1: This figure is taken from DeepWalk: Online Learning of Social Representations.

Networks are high-dimensional discrete data that can be difficult to analyze with traditional machine learning methods that assume continuous and smooth data. Network embedding is a technique to represent networks in low-dimensional continuous spaces, enabling us to apply standard machine learning algorithms.

The goal of network embedding is to map each node in a network to a point in a low-dimensional space (typically \mathbb{R}^d where d \ll N) while preserving important structural properties of the network.

2 Exercises

✍️ Pen and paper exercises

3 Spectral Embedding

Network Compression Approach

Let us approach spectral embedding from the perspective of network compression. Suppose we have an adjacency matrix \mathbf{A} of a network. The adjacency matrix is high-dimensional data, i.e., a matrix of size N \times N for a network of N nodes. We want to compress it into a lower-dimensional matrix \mathbf{U} of size N \times d for a user-defined small integer d < N. A good \mathbf{U} should preserve the network structure and thus can reconstruct the original data \mathbf{A} as closely as possible. This leads to the following optimization problem:

\min_{\mathbf{U}} J(\mathbf{U}),\quad J(\mathbf{U}) = \| \mathbf{A} - \mathbf{U}\mathbf{U}^\top \|_F^2

where:

\mathbf{U}\mathbf{U}^\top is the outer product of \mathbf{U} and represents the reconstructed network.
\|\cdot\|_F is the Frobenius norm, which is the sum of the squares of the elements in the matrix.
J(\mathbf{U}) is the loss function that measures the difference between the original network \mathbf{A} and the reconstructed network \mathbf{U}\mathbf{U}^\top.

By minimizing the Frobenius norm with respect to \mathbf{U}, we obtain the best low-dimensional embedding of the network.

Spectral Decomposition Solution

Consider the spectral decomposition of \mathbf{A}:

\mathbf{A} = \sum_{i=1}^N \lambda_i \mathbf{u}_i \mathbf{u}_i^\top

where \lambda_i are eigenvalues (weights) and \mathbf{u}_i are eigenvectors. Each term \lambda_i \mathbf{u}_i \mathbf{u}_i^\top is a rank-one matrix that captures a part of the network’s structure. The larger the weight \lambda_i, the more important that term is in describing the network.

To compress the network, we can select the d terms with the largest weights \lambda_i. By combining the corresponding \mathbf{u}_i vectors into a matrix \mathbf{U}, we obtain a good low-dimensional embedding of the network.

For a formal proof, please refer to the Appendix section.

Modularity Embedding

In a similar vein, we can use the modularity matrix to generate a low-dimensional embedding of the network. Let us define the modularity matrix \mathbf{Q} as follows:

Q_{ij} = \frac{1}{2m}A_{ij} - \frac{k_i k_j}{4m^2}

where k_i is the degree of node i, and m is the number of edges in the network.

We then compute the eigenvectors of \mathbf{Q} and use them to embed the network into a low-dimensional space just as we did for the adjacency matrix. The modularity embedding can be used to bipartition the network into two communities using a simple algorithm: group nodes with the same sign of the second eigenvector (Newman 2006).

Laplacian Eigenmap

Laplacian Eigenmap (Belkin and Niyogi 2003) is another approach to compress a network into a low-dimensional space. The fundamental idea behind this method is to position connected nodes close to each other in the low-dimensional space. This approach leads to the following optimization problem:

\min_{\mathbf{U}} J_{LE}(\mathbf{U}),\quad J_{LE}(\mathbf{U}) = \frac{1}{2}\sum_{i,j} A_{ij} \| u_i - u_j \|^2

Derivation steps:

Starting from J_{LE}(\mathbf{U}), we expand:

\begin{aligned} &= \frac{1}{2}\sum_{i,j} A_{ij} \left( \| u_i \|^2 - 2 u_i^\top u_j + \| u_j \|^2 \right) \\ &= \sum_{i} k_i \| u_i \|^2 - \sum_{i,j} A_{ij} u_i^\top u_j\\ &= \sum_{i,j} L_{ij} u_i^\top u_j \end{aligned}

where L_{ij} = k_i if i=j and L_{ij} = -A_{ij} otherwise.

In matrix form: J_{LE}(\mathbf{U}) = \text{Tr}(\mathbf{U}^\top \mathbf{L} \mathbf{U})

See Appendix for full details.

The solution minimizes distances between connected nodes. Through algebraic manipulation (see margin), we can rewrite the objective as J_{LE}(\mathbf{U}) = \text{Tr}(\mathbf{U}^\top \mathbf{L} \mathbf{U}), where \mathbf{L} is the graph Laplacian matrix:

L_{ij} = \begin{cases} k_i & \text{if } i = j \\ -A_{ij} & \text{if } i \neq j \end{cases}

By taking the derivative and setting it to zero:

\frac{\partial J_{LE}}{\partial \mathbf{U}} = 0 \implies \mathbf{L} \mathbf{U} = \lambda \mathbf{U}

Solution: The d eigenvectors associated with the d smallest eigenvalues of the Laplacian matrix \mathbf{L}.

The smallest eigenvalue is always zero (with eigenvector of all ones). In practice, compute d+1 smallest eigenvectors and discard the trivial one.

4 Neural Embedding

Introduction to word2vec

Neural embedding methods leverage neural network architectures to learn node representations. Before discussing graph-specific methods, we first introduce word2vec, which forms the foundation for many neural graph embedding techniques.

word2vec is a neural network model that learns word embeddings in a continuous vector space. It was introduced by Tomas Mikolov and his colleagues at Google in 2013 (Mikolov et al. 2013).

How word2vec Works

word2vec learns word meanings from context, following the linguistic principle: “You shall know a word by the company it keeps” (church1988word?).

This phrase means a person is similar to those they spend time with. It comes from Aesop’s fable The Ass and his Purchaser (500s B.C.): a man brings an ass to his farm on trial. The ass immediately seeks out the laziest, greediest ass in the herd. The man returns the ass, knowing it will be lazy and greedy based on the company it chose.

The Core Idea: Given a target word, predict its surrounding context words within a fixed window. For example, in:

The quick brown fox jumps over a lazy dog

The context of fox (with window size 2) includes: quick, brown, jumps, over.

The window size determines how far we look around the target word. A window size of 2 means we consider 2 words before and 2 words after.

Why This Works: Words appearing in similar contexts get similar embeddings. Consider:

“The quick brown fox jumps over the fence”
“The quick brown dog runs over the fence”
“The student studies in the library”

Both fox and dog appear with words like “quick,” “brown,” and “jumps/runs,” so they’ll have similar embeddings. But student appears in completely different contexts (like “studies,” “library”), so its embedding will be far from fox and dog. This is how word2vec captures semantic similarity without explicit supervision.

Two Training Approaches:

Skip-gram: Given a target word → predict context words (we’ll use this)
CBOW: Given context words → predict target word

Skip-gram works better for small datasets and rare words. CBOW is faster to train.

Network Architecture: word2vec uses a simple 3-layer neural network:

Input layer (N neurons): One-hot encoding of the target word
Hidden layer (d neurons, d \ll N): The learned word embedding
Output layer (N neurons): Probability distribution over context words (via softmax)

The hidden layer activations become the word embeddings—dense, low-dimensional vectors that capture semantic relationships.

For a visual walkthrough of word2vec, see The Illustrated Word2vec by Jay Alammar.

Key Technical Components

The Computational Challenge

To predict context word w_c given target word w_t, we compute:

P(w_c | w_t) = \frac{\exp(\mathbf{v}_{w_c} \cdot \mathbf{v}_{w_t})}{\sum_{w \in V} \exp(\mathbf{v}_w \cdot \mathbf{v}_{w_t})}

This is the softmax function—it converts dot products into probabilities that sum to 1.

The problem: The denominator sums over all vocabulary words (e.g., 100,000+ terms), making training prohibitively slow.

Two Solutions

Hierarchical Softmax: Organizes vocabulary as a binary tree with words at leaves. Computing probability becomes traversing root-to-leaf paths, reducing complexity from O(|V|) to O(\log |V|).

Negative Sampling: Instead of normalizing over all words, sample a few “negative” (non-context) words. Contrast the target’s true context with these random words. This approximates the full softmax much more efficiently.

Typically, 5-20 negative samples are enough. The model learns to distinguish true context words from random words.

What’s Special About word2vec?

With word2vec, words are represented as dense vectors, enabling us to explore their relationships using simple linear algebra. This is in contrast to traditional natural language processing (NLP) methods, such as bag-of-words and topic modeling, which represent words as discrete units or high-dimensional vectors.

word2vec embeddings can capture semantic relationships, such as analogies (e.g., man is to woman as king is to queen) and can visualize relationships between concepts (e.g., countries and their capitals form parallel vectors in the embedding space).

5 Graph Embedding with word2vec

How can we apply word2vec to graph data? There is a critical challenge: word2vec takes sequences of words as input, while graph data are discrete and unordered. A solution to fill this gap is random walk, which transforms graph data into a sequence of nodes. Once we have a sequence of nodes, we can treat it as a sequence of words and apply word2vec.

Random walks create “sentences” from graphs: each walk is a sequence of nodes, just like a sentence is a sequence of words.

DeepWalk

DeepWalk is one of the pioneering works to apply word2vec to graph data (Perozzi, Al-Rfou, and Skiena 2014). It views the nodes as words and the random walks on the graph as sentences, and applies word2vec to learn the node embeddings.

More specifically, the method contains the following steps:

Sample multiple random walks from the graph.
Treat the random walks as sentences and feed them to word2vec to learn the node embeddings.

DeepWalk typically uses the skip-gram model with hierarchical softmax for efficient training.

node2vec

node2vec (Grover and Leskovec 2016) extends DeepWalk with biased random walks controlled by two parameters:

P(v_{t+1} = x | v_t = v, v_{t-1} = t) \propto \begin{cases} \frac{1}{p} & \text{if } d(t,x) = 0 \text{ (return to previous)} \\ 1 & \text{if } d(t,x) = 1 \text{ (close neighbor)} \\ \frac{1}{q} & \text{if } d(t,x) = 2 \text{ (explore further)} \\ \end{cases}

where d(t,x) is the shortest path distance from the previous node t to candidate node x.

Controlling Exploration:

Low p → return bias (local revisiting)
Low q → outward bias (exploration)
High q → inward bias (stay local)

Two Exploration Strategies:

BFS-like (low q): Explore immediate neighborhoods → captures community structure
DFS-like (high q): Explore deep paths → captures structural roles

Technical Note: node2vec uses negative sampling instead of hierarchical softmax, which affects embedding characteristics (Kojaku et al. 2021; Dyer 2014).

LINE

LINE (Tang et al. 2015) is another pioneering work to learn node embeddings by directly optimizing the graph structure. It is equivalent to node2vec with p=1, q=1, and window size 1.

Neural methods are less transparent, but recent work establishes equivalences to spectral methods under specific conditions (Qiu et al. 2018; Kojaku et al. 2024). Surprisingly, DeepWalk, node2vec, and LINE are also provably optimal for the stochastic block model (Kojaku et al. 2024).

6 Hyperbolic Embeddings: Beyond Euclidean Space

All the embedding methods we’ve discussed so far operate in Euclidean space (standard flat geometry). However, many real-world networks exhibit hierarchical structures and scale-free properties that are difficult to capture in Euclidean space. Hyperbolic embeddings offer a powerful alternative by embedding networks in hyperbolic geometry—a curved space with negative curvature.

The Popularity-Similarity Framework

The foundation for hyperbolic network embeddings comes from a seminal discovery: real networks grow by optimizing trade-offs between popularity and similarity (Papadopoulos et al. 2012). When analyzing the evolution of the Internet, metabolic networks, and social networks, Papadopoulos et al. found that new connections are not formed simply by preferential attachment to popular nodes. Instead, nodes balance two factors:

Popularity: Connecting to well-established, highly connected nodes
Similarity: Connecting to nodes with similar characteristics or interests

This popularity × similarity optimization has a remarkable geometric interpretation: it naturally emerges as optimization in hyperbolic space. In this framework:

Node popularity maps to radial position (birth time, with older = more central)
Node similarity maps to angular distance on a circle
New connections form to nodes that are hyperbolically closest

In the popularity-similarity model, hyperbolic distance x_{ij} \approx r_i + r_j + \ln(\theta_{ij}/2) combines popularity (radial coordinates r) and similarity (angular distance \theta). New nodes connect to the m hyperbolically closest nodes.

This framework elegantly explains why preferential attachment emerges as a phenomenon: it’s not a primitive mechanism, but rather a consequence of local geometric optimization. The probability that a node of degree k attracts a new link follows \Pi(k) \propto k, exactly as in preferential attachment—but now with crucial differences in which specific nodes get connected based on their similarity.

The popularity-similarity framework directly validates against real network evolution: the probability of new connections as a function of hyperbolic distance matches theoretical predictions for Internet, metabolic, and social networks, while pure preferential attachment is off by orders of magnitude (Papadopoulos et al. 2012).

Why Hyperbolic Space?

The study of hyperbolic geometry in complex networks emerged from foundational work showing that networks with heterogeneous degree distributions and strong clustering have an effective hyperbolic geometry underneath (Krioukov et al. 2010). Hyperbolic space has a unique property: its volume grows exponentially with radius, similar to how tree-like hierarchies grow. This naturally captures key network properties:

Scale-free degree distributions: Power-law degrees emerge from negative curvature
Strong clustering: Nodes form tight communities based on metric proximity
Small-world property: Short paths exist despite high clustering
Self-similarity and navigability: Arise from underlying hyperbolic geometry (Serrano, Krioukov, and Boguñá 2008; Boguñá, Krioukov, and Claffy 2009)

The connection between hyperbolic geometry and network structure was established through a series of papers by Krioukov, Boguñá, and colleagues (2008-2010), showing that the statistical mechanics of complex networks maps directly to geometric properties of hyperbolic space.

Mathematical Models of Hyperbolic Space

Hyperbolic space can be represented through different mathematical models, each with different computational properties:

Poincaré Ball Model: Points lie within the unit ball \mathcal{B}^n = \{\mathbf{x} \in \mathbb{R}^n : \|\mathbf{x}\| < 1\}. The distance between two points \mathbf{u}, \mathbf{v} \in \mathcal{B}^n is:

d_P(\mathbf{u}, \mathbf{v}) = \text{arcosh}\left(1 + 2\frac{\|\mathbf{u} - \mathbf{v}\|^2}{(1 - \|\mathbf{u}\|^2)(1 - \|\mathbf{v}\|^2)}\right)

The Poincaré disk (2D version of the ball) visualizes hyperbolic space within a circle. Although shapes appear to shrink near the boundary in Euclidean terms, they are actually all the same size in hyperbolic geometry—the disk has infinite hyperbolic area. Geodesics (shortest paths) appear as circular arcs perpendicular to the boundary. This model is intuitive for visualization but requires careful handling of Riemannian gradients during optimization.

Lorentzian (Hyperboloid) Model: Points lie on the hyperboloid \mathcal{H}^n = \{\mathbf{x} \in \mathbb{R}^{n+1} : \langle\mathbf{x}, \mathbf{x}\rangle_L = -1, x_0 > 0\} where the Lorentzian inner product is:

\langle\mathbf{x}, \mathbf{y}\rangle_L = -x_0 y_0 + x_1 y_1 + \cdots + x_n y_n

The distance between two points \mathbf{u}, \mathbf{v} \in \mathcal{H}^n is:

d_L(\mathbf{u}, \mathbf{v}) = \text{arcosh}(-\langle\mathbf{u}, \mathbf{v}\rangle_L)

The hyperboloid model embeds hyperbolic space as a sheet in Minkowski space (the geometry of special relativity). Points on the hyperboloid can be projected onto the Poincaré disk, showing the equivalence between models. The Lorentzian model enables more efficient optimization because gradients can be computed in the ambient Euclidean space \mathbb{R}^{n+1}, avoiding complex Riemannian operations.

These models are isometric (distance-preserving transformations exist between them), but differ in computational efficiency. The Poincaré model (Nickel and Kiela 2017) is intuitive but requires complex Riemannian optimization. The Lorentzian model (Nickel and Kiela 2018) trades geometric purity for efficiency by computing gradients in ambient Euclidean space, making hyperbolic embeddings practical for large networks.

Figure 4: Hyperbolic embedding of wordnet nouns (Nickel and Kiela 2018). Abstract terms like “Entity” positions near the center. As we go towards the boundary, we see more concrete concepts like “Material”, “Wood”, and “Hardwood”.

References

Belkin, Mikhail, and Partha Niyogi. 2003. “Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.” Neural Computation 15 (6): 1373–96. https://doi.org/10.1162/089976603321780317.

Boguñá, Marián, Dmitri Krioukov, and KC Claffy. 2009. “Navigability of complex networks.” Nature Physics 5 (1): 74–80.

Dyer, Chris. 2014. “Notes on Noise Contrastive Estimation and Negative Sampling.” arXiv Preprint arXiv:1410.8251.

Grover, Aditya, and Jure Leskovec. 2016. “Node2vec: Scalable Feature Learning for Networks.” In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 855–64.

Kojaku, Sadamori, Filippo Radicchi, Yong-Yeol Ahn, and Santo Fortunato. 2024. “Network Community Detection via Neural Embeddings.” Nature Communications 15 (1): 9446.

Kojaku, Sadamori, Jisung Yoon, Isabel Constantino, and Yong-Yeol Ahn. 2021. “Residual2Vec: Debiasing Graph Embedding Using Random Graphs.” In Advances in Neural Information Processing Systems. Curran Associates, Inc.

Krioukov, Dmitri, Fragkiskos Papadopoulos, Maksim Kitsak, Marián Boguñá, and Amin Vahdat. 2010. “Hyperbolic geometry of complex networks.” Physical Review E 82 (3): 036106.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. “Distributed Representations of Words and Phrases and Their Compositionality.” Advances in Neural Information Processing Systems 26.

Newman, M. E. J. 2006. “Modularity and Community Structure in Networks.” Proceedings of the National Academy of Sciences 103 (23): 8577–82. https://doi.org/10.1073/pnas.0601602103.

Nickel, Maximillian, and Douwe Kiela. 2017. “Poincaré Embeddings for Learning Hierarchical Representations.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/59dfa2df42d9e3d41f5b02bfc32229dd-Paper.pdf.

———. 2018. “Learning continuous hierarchies in the Lorentz model of hyperbolic geometry.” In International Conference on Machine Learning, 3779–88. PMLR.

Papadopoulos, Fragkiskos, Maksim Kitsak, M. Ángeles Serrano, Marián Boguñá, and Dmitri Krioukov. 2012. “Popularity versus similarity in growing networks.” Nature 489 (7417): 537–40. https://doi.org/10.1038/nature11459.

Perozzi, Bryan, Rami Al-Rfou, and Steven Skiena. 2014. “Deepwalk: Online Learning of Social Representations.” In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–10.

Qiu, Jiezhong, Yuxiao Dong, Hao Ma, Jian Li, Kuansan Wang, and Jie Tang. 2018. “Network Embedding as Matrix Factorization: Unifying Deepwalk, Line, Pte, and Node2vec.” In Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 459–67.

Serrano, M. Ángeles, Dmitri Krioukov, and Marián Boguñá. 2008. “Self-similarity of complex networks and hidden metric spaces.” Physical Review Letters 100 (7): 078701.

Tang, Jian, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. 2015. “LINE: Large-Scale Information Network Embedding.” In Proceedings of the 24th International Conference on World Wide Web, 1067–77. WWW ’15. Republic; Canton of Geneva, CHE: International World Wide Web Conferences Steering Committee. https://doi.org/10.1145/2736277.2741093.

--- title: "Network Embedding Concepts" jupyter: advnetsci execute: enabled: true --- ## What is Network Embedding? ::: {#fig-karate-to-embedding} ![](https://transformerswsz.github.io/2022/09/09/DeepWalk%E8%A7%A3%E8%AF%BB/karate_to_embedding.jpg) This figure is taken from [DeepWalk: Online Learning of Social Representations](https://arxiv.org/abs/1403.6652). ::: Networks are high-dimensional discrete data that can be difficult to analyze with traditional machine learning methods that assume continuous and smooth data. Network embedding is a technique to represent networks in low-dimensional continuous spaces, enabling us to apply standard machine learning algorithms. The goal of network embedding is to map each node in a network to a point in a low-dimensional space (typically $\mathbb{R}^d$ where $d \ll N$) while preserving important structural properties of the network. ## Exercises - [✍️ Pen and paper exercises](pen-and-paper/exercise.pdf) ## Spectral Embedding ### Network Compression Approach Let us approach spectral embedding from the perspective of network compression. Suppose we have an adjacency matrix $\mathbf{A}$ of a network. The adjacency matrix is high-dimensional data, i.e., a matrix of size $N \times N$ for a network of $N$ nodes. We want to compress it into a lower-dimensional matrix $\mathbf{U}$ of size $N \times d$ for a user-defined small integer $d < N$. A good $\mathbf{U}$ should preserve the network structure and thus can reconstruct the original data $\mathbf{A}$ as closely as possible. This leads to the following optimization problem: $$ \min_{\mathbf{U}} J(\mathbf{U}),\quad J(\mathbf{U}) = \| \mathbf{A} - \mathbf{U}\mathbf{U}^\top \|_F^2 $$ where: 1. $\mathbf{U}\mathbf{U}^\top$ is the outer product of $\mathbf{U}$ and represents the reconstructed network. 2. $\|\cdot\|_F$ is the Frobenius norm, which is the sum of the squares of the elements in the matrix. 3. $J(\mathbf{U})$ is the loss function that measures the difference between the original network $\mathbf{A}$ and the reconstructed network $\mathbf{U}\mathbf{U}^\top$. By minimizing the Frobenius norm with respect to $\mathbf{U}$, we obtain the best low-dimensional embedding of the network. ### Spectral Decomposition Solution Consider the spectral decomposition of $\mathbf{A}$: $$ \mathbf{A} = \sum_{i=1}^N \lambda_i \mathbf{u}_i \mathbf{u}_i^\top $$ where $\lambda_i$ are eigenvalues (weights) and $\mathbf{u}_i$ are eigenvectors. Each term $\lambda_i \mathbf{u}_i \mathbf{u}_i^\top$ is a rank-one matrix that captures a part of the network's structure. The larger the weight $\lambda_i$, the more important that term is in describing the network. To compress the network, we can select the $d$ terms with the largest weights $\lambda_i$. By combining the corresponding $\mathbf{u}_i$ vectors into a matrix $\mathbf{U}$, we obtain a good low-dimensional embedding of the network. ![](../figs/spectral-decomposition.jpg) For a formal proof, please refer to the [Appendix section](./04-appendix.md). ### Modularity Embedding In a similar vein, we can use the modularity matrix to generate a low-dimensional embedding of the network. Let us define the modularity matrix $\mathbf{Q}$ as follows: $$ Q_{ij} = \frac{1}{2m}A_{ij} - \frac{k_i k_j}{4m^2} $$ where $k_i$ is the degree of node $i$, and $m$ is the number of edges in the network. We then compute the eigenvectors of $\mathbf{Q}$ and use them to embed the network into a low-dimensional space just as we did for the adjacency matrix. The modularity embedding can be used to bipartition the network into two communities using a simple algorithm: group nodes with the same sign of the second eigenvector [@newman2006modularity]. ### Laplacian Eigenmap Laplacian Eigenmap [@belkin2003laplacian] is another approach to compress a network into a low-dimensional space. The fundamental idea behind this method is to position connected nodes close to each other in the low-dimensional space. This approach leads to the following optimization problem: $$ \min_{\mathbf{U}} J_{LE}(\mathbf{U}),\quad J_{LE}(\mathbf{U}) = \frac{1}{2}\sum_{i,j} A_{ij} \| u_i - u_j \|^2 $$ ::: {.column-margin} **Derivation steps:** Starting from $J_{LE}(\mathbf{U})$, we expand: $$ \begin{aligned} &= \frac{1}{2}\sum_{i,j} A_{ij} \left( \| u_i \|^2 - 2 u_i^\top u_j + \| u_j \|^2 \right) \\ &= \sum_{i} k_i \| u_i \|^2 - \sum_{i,j} A_{ij} u_i^\top u_j\\ &= \sum_{i,j} L_{ij} u_i^\top u_j \end{aligned} $$ where $L_{ij} = k_i$ if $i=j$ and $L_{ij} = -A_{ij}$ otherwise. In matrix form: $J_{LE}(\mathbf{U}) = \text{Tr}(\mathbf{U}^\top \mathbf{L} \mathbf{U})$ See [Appendix](./04-appendix.md) for full details. ::: The solution minimizes distances between connected nodes. Through algebraic manipulation (see margin), we can rewrite the objective as $J_{LE}(\mathbf{U}) = \text{Tr}(\mathbf{U}^\top \mathbf{L} \mathbf{U})$, where $\mathbf{L}$ is the graph Laplacian matrix: $$ L_{ij} = \begin{cases} k_i & \text{if } i = j \\ -A_{ij} & \text{if } i \neq j \end{cases} $$ By taking the derivative and setting it to zero: $$ \frac{\partial J_{LE}}{\partial \mathbf{U}} = 0 \implies \mathbf{L} \mathbf{U} = \lambda \mathbf{U} $$ **Solution**: The $d$ eigenvectors associated with the $d$ smallest eigenvalues of the Laplacian matrix $\mathbf{L}$. ::: {.column-margin} The smallest eigenvalue is always zero (with eigenvector of all ones). In practice, compute $d+1$ smallest eigenvectors and discard the trivial one. ::: ## Neural Embedding ### Introduction to word2vec Neural embedding methods leverage neural network architectures to learn node representations. Before discussing graph-specific methods, we first introduce *word2vec*, which forms the foundation for many neural graph embedding techniques. word2vec is a neural network model that learns word embeddings in a continuous vector space. It was introduced by Tomas Mikolov and his colleagues at Google in 2013 [@mikolov2013distributed]. ### How word2vec Works word2vec learns word meanings from context, following the linguistic principle: "You shall know a word by the company it keeps" [@church1988word]. ::: {.column-margin} This phrase means a person is similar to those they spend time with. It comes from Aesop's fable *The Ass and his Purchaser* (500s B.C.): a man brings an ass to his farm on trial. The ass immediately seeks out the laziest, greediest ass in the herd. The man returns the ass, knowing it will be lazy and greedy based on the company it chose. <iframe width="100%" height="200" src="https://www.youtube.com/embed/gQddtTdmG_8?si=x8DUQnll2Rnj8qkn" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe> ::: **The Core Idea**: Given a target word, predict its surrounding context words within a fixed window. For example, in: > The quick brown fox jumps over a lazy dog The context of *fox* (with window size 2) includes: *quick*, *brown*, *jumps*, *over*. ::: {.column-margin} The window size determines how far we look around the target word. A window size of 2 means we consider 2 words before and 2 words after. ::: **Why This Works**: Words appearing in similar contexts get similar embeddings. Consider: - "The quick brown **fox** jumps over the fence" - "The quick brown **dog** runs over the fence" - "The **student** studies in the library" Both *fox* and *dog* appear with words like "quick," "brown," and "jumps/runs," so they'll have similar embeddings. But *student* appears in completely different contexts (like "studies," "library"), so its embedding will be far from *fox* and *dog*. This is how word2vec captures semantic similarity without explicit supervision. **Two Training Approaches**: - **Skip-gram**: Given a target word → predict context words (we'll use this) - **CBOW**: Given context words → predict target word ::: {.column-margin} Skip-gram works better for small datasets and rare words. CBOW is faster to train. ::: **Network Architecture**: word2vec uses a simple 3-layer neural network: ![](../figs/word2vec.png){width="70%" fig-align="center"} - **Input layer** ($N$ neurons): One-hot encoding of the target word - **Hidden layer** ($d$ neurons, $d \ll N$): The learned word embedding - **Output layer** ($N$ neurons): Probability distribution over context words (via softmax) The hidden layer activations become the word embeddings—dense, low-dimensional vectors that capture semantic relationships. ::: {.column-margin} For a visual walkthrough of word2vec, see [The Illustrated Word2vec](https://jalammar.github.io/illustrated-word2vec/) by Jay Alammar. ::: ### Key Technical Components #### The Computational Challenge To predict context word $w_c$ given target word $w_t$, we compute: $$ P(w_c | w_t) = \frac{\exp(\mathbf{v}_{w_c} \cdot \mathbf{v}_{w_t})}{\sum_{w \in V} \exp(\mathbf{v}_w \cdot \mathbf{v}_{w_t})} $$ ::: {.column-margin} This is the softmax function—it converts dot products into probabilities that sum to 1. ::: **The problem**: The denominator sums over all vocabulary words (e.g., 100,000+ terms), making training prohibitively slow. #### Two Solutions **Hierarchical Softmax**: Organizes vocabulary as a binary tree with words at leaves. Computing probability becomes traversing root-to-leaf paths, reducing complexity from $O(|V|)$ to $O(\log |V|)$. ::: {#fig-hierarchical-softmax} ![](https://building-babylon.net/wp-content/uploads/2017/07/hs4-1536x834.png) This figure is taken from [Hierarchical Softmax](https://building-babylon.net/2017/08/01/hierarchical-softmax/) by Building Babylon. ::: **Negative Sampling**: Instead of normalizing over all words, sample a few "negative" (non-context) words. Contrast the target's true context with these random words. This approximates the full softmax much more efficiently. ::: {.column-margin} Typically, 5-20 negative samples are enough. The model learns to distinguish true context words from random words. ::: ### What's Special About word2vec? With word2vec, words are represented as dense vectors, enabling us to explore their relationships using simple linear algebra. This is in contrast to traditional natural language processing (NLP) methods, such as bag-of-words and topic modeling, which represent words as discrete units or high-dimensional vectors. ::: {#fig-word2vec-embedding} ![](https://miro.medium.com/v2/resize:fit:678/1*5F4TXdFYwqi-BWTToQPIfg.jpeg) This figure is taken from [Word2Vec](https://medium.com/@laura.wj.w/word2vec-207131259292) by Laura Wang. ::: word2vec embeddings can capture semantic relationships, such as analogies (e.g., *man* is to *woman* as *king* is to *queen*) and can visualize relationships between concepts (e.g., countries and their capitals form parallel vectors in the embedding space). ## Graph Embedding with word2vec How can we apply word2vec to graph data? There is a critical challenge: word2vec takes sequences of words as input, while graph data are discrete and unordered. A solution to fill this gap is *random walk*, which transforms graph data into a sequence of nodes. Once we have a sequence of nodes, we can treat it as a sequence of words and apply word2vec. ::: {.column-margin} Random walks create "sentences" from graphs: each walk is a sequence of nodes, just like a sentence is a sequence of words. ::: ### DeepWalk ![](https://dt5vp8kor0orz.cloudfront.net/7c56c256b9fbf06693da47737ac57fae803a5a4f/1-Figure1-1.png) DeepWalk is one of the pioneering works to apply word2vec to graph data [@perozzi2014deepwalk]. It views the nodes as words and the random walks on the graph as sentences, and applies word2vec to learn the node embeddings. More specifically, the method contains the following steps: 1. Sample multiple random walks from the graph. 2. Treat the random walks as sentences and feed them to word2vec to learn the node embeddings. DeepWalk typically uses the skip-gram model with hierarchical softmax for efficient training. ### node2vec node2vec [@grover2016node2vec] extends DeepWalk with **biased random walks** controlled by two parameters: $$ P(v_{t+1} = x | v_t = v, v_{t-1} = t) \propto \begin{cases} \frac{1}{p} & \text{if } d(t,x) = 0 \text{ (return to previous)} \\ 1 & \text{if } d(t,x) = 1 \text{ (close neighbor)} \\ \frac{1}{q} & \text{if } d(t,x) = 2 \text{ (explore further)} \\ \end{cases} $$ where $d(t,x)$ is the shortest path distance from the previous node $t$ to candidate node $x$. **Controlling Exploration**: - Low $p$ → return bias (local revisiting) - Low $q$ → outward bias (exploration) - High $q$ → inward bias (stay local) **Two Exploration Strategies**: ![](https://www.researchgate.net/publication/354654762/figure/fig3/AS:1069013035655173@1631883977008/A-biased-random-walk-procedure-of-node2vec-B-BFS-and-DFS-search-strategies-from-node-u.png) - **BFS-like** (low $q$): Explore immediate neighborhoods → captures **community structure** - **DFS-like** (high $q$): Explore deep paths → captures **structural roles** ![](https://miro.medium.com/v2/resize:fit:1138/format:webp/1*nCyF5jFSU5uJVdAPdf-0HA.png) ::: {.column-margin} **Technical Note**: node2vec uses **negative sampling** instead of hierarchical softmax, which affects embedding characteristics [@kojaku2021neurips; @dyer2014notes]. ::: ### LINE LINE [@tang2015line] is another pioneering work to learn node embeddings by directly optimizing the graph structure. It is equivalent to node2vec with $p=1$, $q=1$, and window size 1. ::: {.column-margin} Neural methods are less transparent, but recent work establishes equivalences to spectral methods under specific conditions [@qiu2018network; @kojaku2024network]. Surprisingly, DeepWalk, node2vec, and LINE are also provably optimal for the stochastic block model [@kojaku2024network]. ::: ## Hyperbolic Embeddings: Beyond Euclidean Space All the embedding methods we've discussed so far operate in Euclidean space (standard flat geometry). However, many real-world networks exhibit hierarchical structures and scale-free properties that are difficult to capture in Euclidean space. **Hyperbolic embeddings** offer a powerful alternative by embedding networks in hyperbolic geometry—a curved space with negative curvature. ### The Popularity-Similarity Framework The foundation for hyperbolic network embeddings comes from a seminal discovery: **real networks grow by optimizing trade-offs between popularity and similarity** [@papadopoulos2012popularity]. When analyzing the evolution of the Internet, metabolic networks, and social networks, Papadopoulos et al. found that new connections are not formed simply by preferential attachment to popular nodes. Instead, nodes balance two factors: - **Popularity**: Connecting to well-established, highly connected nodes - **Similarity**: Connecting to nodes with similar characteristics or interests This popularity × similarity optimization has a remarkable geometric interpretation: it naturally emerges as **optimization in hyperbolic space**. In this framework: - Node **popularity** maps to radial position (birth time, with older = more central) - Node **similarity** maps to angular distance on a circle - New connections form to nodes that are **hyperbolically closest** ![](https://openmetric.org/assets/posts/popularity-similarity-in-networks/internet-routers.png) ::: {.column-margin} In the popularity-similarity model, hyperbolic distance $x_{ij} \approx r_i + r_j + \ln(\theta_{ij}/2)$ combines popularity (radial coordinates $r$) and similarity (angular distance $\theta$). New nodes connect to the $m$ hyperbolically closest nodes. ::: This framework elegantly explains why **preferential attachment emerges** as a phenomenon: it's not a primitive mechanism, but rather a consequence of local geometric optimization. The probability that a node of degree $k$ attracts a new link follows $\Pi(k) \propto k$, exactly as in preferential attachment—but now with crucial differences in which specific nodes get connected based on their similarity. ::: {.column-margin} The popularity-similarity framework directly validates against real network evolution: the probability of new connections as a function of hyperbolic distance matches theoretical predictions for Internet, metabolic, and social networks, while pure preferential attachment is off by orders of magnitude [@papadopoulos2012popularity]. ::: ### Why Hyperbolic Space? The study of hyperbolic geometry in complex networks emerged from foundational work showing that **networks with heterogeneous degree distributions and strong clustering have an effective hyperbolic geometry underneath** [@krioukov2010hyperbolic]. Hyperbolic space has a unique property: its volume grows **exponentially** with radius, similar to how tree-like hierarchies grow. This naturally captures key network properties: - **Scale-free degree distributions**: Power-law degrees emerge from negative curvature - **Strong clustering**: Nodes form tight communities based on metric proximity - **Small-world property**: Short paths exist despite high clustering - **Self-similarity and navigability**: Arise from underlying hyperbolic geometry [@serrano2008self; @boguna2009navigability] ::: {.column-margin} The connection between hyperbolic geometry and network structure was established through a series of papers by Krioukov, Boguñá, and colleagues (2008-2010), showing that the statistical mechanics of complex networks maps directly to geometric properties of hyperbolic space. ::: ### Mathematical Models of Hyperbolic Space Hyperbolic space can be represented through different mathematical models, each with different computational properties: **Poincaré Ball Model**: Points lie within the unit ball $\mathcal{B}^n = \{\mathbf{x} \in \mathbb{R}^n : \|\mathbf{x}\| < 1\}$. The distance between two points $\mathbf{u}, \mathbf{v} \in \mathcal{B}^n$ is: $$ d_P(\mathbf{u}, \mathbf{v}) = \text{arcosh}\left(1 + 2\frac{\|\mathbf{u} - \mathbf{v}\|^2}{(1 - \|\mathbf{u}\|^2)(1 - \|\mathbf{v}\|^2)}\right) $$ ::: {.column-margin} ![](https://mathworld.wolfram.com/images/eps-svg/HyperbolicTessellation_1000.svg){width=100%} The **Poincaré disk** (2D version of the ball) visualizes hyperbolic space within a circle. Although shapes appear to shrink near the boundary in Euclidean terms, they are actually all the same size in hyperbolic geometry—the disk has infinite hyperbolic area. Geodesics (shortest paths) appear as circular arcs perpendicular to the boundary. This model is intuitive for visualization but requires careful handling of Riemannian gradients during optimization. ::: **Lorentzian (Hyperboloid) Model**: Points lie on the hyperboloid $\mathcal{H}^n = \{\mathbf{x} \in \mathbb{R}^{n+1} : \langle\mathbf{x}, \mathbf{x}\rangle_L = -1, x_0 > 0\}$ where the Lorentzian inner product is: $$ \langle\mathbf{x}, \mathbf{y}\rangle_L = -x_0 y_0 + x_1 y_1 + \cdots + x_n y_n $$ The distance between two points $\mathbf{u}, \mathbf{v} \in \mathcal{H}^n$ is: $$ d_L(\mathbf{u}, \mathbf{v}) = \text{arcosh}(-\langle\mathbf{u}, \mathbf{v}\rangle_L) $$ ::: {.column-margin} ![](https://bjlkeng.io/images/poincare_projection.png){width=100%} The **hyperboloid model** embeds hyperbolic space as a sheet in Minkowski space (the geometry of special relativity). Points on the hyperboloid can be projected onto the Poincaré disk, showing the equivalence between models. The Lorentzian model enables more efficient optimization because gradients can be computed in the ambient Euclidean space $\mathbb{R}^{n+1}$, avoiding complex Riemannian operations. ::: These models are isometric (distance-preserving transformations exist between them), but differ in computational efficiency. The Poincaré model [@nickel2017poincare] is intuitive but requires complex Riemannian optimization. The Lorentzian model [@nickel2018learning] trades geometric purity for efficiency by computing gradients in ambient Euclidean space, making hyperbolic embeddings practical for large networks. ::: {#fig-hyperbolic0-embedding-knowledge-graph} ![](https://www.researchgate.net/publication/384245499/figure/fig3/AS:11431281279545169@1727060754985/Wordnet-nouns-embedded-in-hyperbolic-space-Nickel-2018.jpg) Hyperbolic embedding of wordnet nouns [@nickel2018learning]. Abstract terms like "Entity" positions near the center. As we go towards the boundary, we see more concrete concepts like "Material", "Wood", and "Hardwood". :::