Efficient representation for large sparse networks

Efficient representation for large sparse networks#

An adjacency matrix is a convenient way to represent a network. A challenge of handling large networks is that the adjacency matrix can be too large to fit in memrory. For example, a network with \(10^5\) nodes requires a \(10^5 \times 10^5\) matrix, totaling \(10\) billion entries! A good news is that we do not need to hold all these entries in memory, if we know the network is sparse.

Many networks in real-world are sparse, meaning most nodes connect to only a few others. The result is that the adjacency matrix often contains many zeros. This is where we can save significant memory by storing only the non-zero entries.

Compressed Sparse Row (CSR) is an efficient way to store sparse networks by treating the adjacency matrix like a scatter plot. Instead of storing all entries, CSR only keeps track of the “coordinates” (row and column indices) of non-zero entries, along with their values.

Optional Exercise

For those who are interested in the details of CSR format, please do the following:

  • 📝 Pen and paper exercise here

  • 💻 (Advanced) Coding exercise in the Appendix.

https://miro.medium.com/v2/resize:fit:1100/format:webp/0*zWhtdW4nYTSO3nya.gif

Fig. 9 Compressed Sparse Row (CSR) matrix. Source: Medium: Sparse GEMM and Tensor Core’s Structured Sparsity#

The CSR format is implemented in the scipy library. It is straightforward to convert the CSR matrix from the dense adjacency matrix.

from scipy.sparse import csr_matrix

A = [[0, 2, 0, 1],
     [2, 0, 2, 1],
     [0, 2, 0, 1],
     [1, 1, 1, 0]]

A_csr = csr_matrix(A)
A_csr
<4x4 sparse matrix of type '<class 'numpy.int64'>'
	with 10 stored elements in Compressed Sparse Row format>

If you have an edge list, you can directly generate the CSR matrix without creating the dense matrix first.

from scipy.sparse import csr_matrix

edges = [(0,1), (0, 1), (0, 3), (1, 2), (1, 2), (1, 3), (2, 3)]

src = [edge[0] for edge in edges]
trg = [edge[1] for edge in edges]
values = [1 for _ in edges]
A_csr = csr_matrix((values, (src, trg)), shape=(4, 4))
A_csr
<4x4 sparse matrix of type '<class 'numpy.int64'>'
	with 5 stored elements in Compressed Sparse Row format>

where src, trg, and values are lists of the source nodes, target nodes, and edge weights, respectively.