Appendix#

Compressed Sparse Row (CSR) format#

CSR format is implemented in scipy. This consists of three arrays called indptr, indices, and data. For example,

import networkx as nx
from scipy import sparse

G = nx.karate_club_graph()
A = sparse.csr_matrix(nx.adjacency_matrix(G))

print("A.indices:", A.indices[:5])
print("A.indptr:", A.indptr[:5])
print("A.data:", A.data[:5])
A.indices: [1 2 3 4 5]
A.indptr: [ 0 16 25 35 41]
A.data: [4 5 3 3 3]

We will walk you through what these arrays mean, how they are generated, and how we can leverage them for efficient computations.

How to generate CSR format from an adjacency matrix#

Let’s walk you through how to store an example adjacency matrix in Compressed Sparse Row (CSR) format. Our example adjacency matrix is as follows.

0

1

2

3

4

5

6

7

8

9

10

0

1

1

1

1

2

1

1

1

3

1

1

1

1

4

1

1

5

1

6

1

1

1

7

1

8

1

1

9

1

1

10

1

1

1

1

1

We will first create adjacency list, which is a dictionary consisting of the row IDs and column IDs for the non-zero entries in the adjacency matrix.

\(\{\text{Row ID}: (\text{Column ID}, \text{Value})\}\)

Concretely, in Python,

adj_list = {
  0:[(10,1)],
  1:[(2,1), (10, 1)],
  2:[(1,1), (3,1), (10, 1)],
  3:[(2,1), (4,1), (5,1), (6,1)],
  #...
}

CSR format is a concatenation of the keys and values of the adjacency list, respectively. The CSR format has a concatenated array of the values, one for column IDs and one for the values, called indices and data, respectively.

import numpy as np

indices = np.array([vv[0] for k, v in adj_list.items() for vv in v])
indices
array([10,  2, 10,  1,  3, 10,  2,  4,  5,  6])
data = np.array([vv[1] for k, v in adj_list.items() for vv in v])
data
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

Additionally, the CSR format has another array called indptr, which stores the Row IDs of the non-zero entries in the adjacency matrix. This indptr array has a value such that indptr[i] is the first index of indices that corresponds to the i-th row of the adjacency matrix. This can be generated by

indptr = np.cumsum([0] + [len(adj_list[i]) for i in range(len(adj_list))])
indptr
array([ 0,  1,  3,  6, 10])

where we added 0 at the beginning of the array to represent the first non-zero entry in the first row. The first row ends at index len(adj_list[0])-1, and the second row starts at index len(adj_list[0]) and ends at index len(adj_list[0])+len(adj_list[1])-1, and so on.

Now we have three compressed vectors indptr, indices, and data, that together form the CSR format for the adjacency matrix.

How to use CSR format for efficient computations#

The key advantage of the CSR representation is the memory efficiency. But you can leverage the CSR format for more efficient computations, if you know the semantics of indptr, indices, and data arrays.

For instance, one can compute the degree of a node by using

node = 1
degree = indptr[node+1] - indptr[node]
degree
2

Let us break down the above code.

  • indptr[node] is the first index of the indices array that corresponds to the node-th row of the adjacency matrix.

  • indptr[node+1] is the first index of the indices array that corresponds to the (node+1)-th row of the adjacency matrix.

  • Thus, indptr[node+1] - indptr[node] is the number of non-zero entries in the node-th row of the adjacency matrix, which is the degree of the node-th node.

Using indices, it is easy to identify the neighbors of a given node by using

neighbors = indices[indptr[node]:indptr[node+1]]
neighbors
array([ 2, 10])

where indices[indptr[node]:indptr[node+1]] is the corresponding column IDs of the non-zero entries in the node-th row of the adjacency matrix, which corresponds to the node IDs connected to the node-th node.

The edge weights to the neighbors can be obtained by using

edge_weights = data[indptr[node]:indptr[node+1]]
edge_weights
array([1, 1])