Appendix#
Compressed Sparse Row (CSR) format#
CSR format is implemented in scipy. This consists of three arrays called indptr
, indices
, and data
. For example,
import networkx as nx
from scipy import sparse
G = nx.karate_club_graph()
A = sparse.csr_matrix(nx.adjacency_matrix(G))
print("A.indices:", A.indices[:5])
print("A.indptr:", A.indptr[:5])
print("A.data:", A.data[:5])
A.indices: [1 2 3 4 5]
A.indptr: [ 0 16 25 35 41]
A.data: [4 5 3 3 3]
We will walk you through what these arrays mean, how they are generated, and how we can leverage them for efficient computations.
How to generate CSR format from an adjacency matrix#
Let’s walk you through how to store an example adjacency matrix in Compressed Sparse Row (CSR) format. Our example adjacency matrix is as follows.
0 |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
---|---|---|---|---|---|---|---|---|---|---|---|
0 |
1 |
||||||||||
1 |
1 |
1 |
|||||||||
2 |
1 |
1 |
1 |
||||||||
3 |
1 |
1 |
1 |
1 |
|||||||
4 |
1 |
1 |
|||||||||
5 |
1 |
||||||||||
6 |
1 |
1 |
1 |
||||||||
7 |
1 |
||||||||||
8 |
1 |
1 |
|||||||||
9 |
1 |
1 |
|||||||||
10 |
1 |
1 |
1 |
1 |
1 |
We will first create adjacency list, which is a dictionary consisting of the row IDs and column IDs for the non-zero entries in the adjacency matrix.
\(\{\text{Row ID}: (\text{Column ID}, \text{Value})\}\)
Concretely, in Python,
adj_list = {
0:[(10,1)],
1:[(2,1), (10, 1)],
2:[(1,1), (3,1), (10, 1)],
3:[(2,1), (4,1), (5,1), (6,1)],
#...
}
CSR format is a concatenation of the keys and values of the adjacency list, respectively. The CSR format has a concatenated array of the values, one for column IDs and one for the values, called indices
and data
, respectively.
import numpy as np
indices = np.array([vv[0] for k, v in adj_list.items() for vv in v])
indices
array([10, 2, 10, 1, 3, 10, 2, 4, 5, 6])
data = np.array([vv[1] for k, v in adj_list.items() for vv in v])
data
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
Additionally, the CSR format has another array called indptr
, which stores the Row IDs of the non-zero entries in the adjacency matrix. This indptr
array has a value such that indptr[i]
is the first index of indices
that corresponds to the i
-th row of the adjacency matrix. This can be generated by
indptr = np.cumsum([0] + [len(adj_list[i]) for i in range(len(adj_list))])
indptr
array([ 0, 1, 3, 6, 10])
where we added 0
at the beginning of the array to represent the first non-zero entry in the first row.
The first row ends at index len(adj_list[0])-1
, and the second row starts at index len(adj_list[0])
and ends at index len(adj_list[0])+len(adj_list[1])-1
, and so on.
Now we have three compressed vectors indptr
, indices
, and data
, that together form the CSR format for the adjacency matrix.
How to use CSR format for efficient computations#
The key advantage of the CSR representation is the memory efficiency. But you can leverage the CSR format for more efficient computations, if you know the semantics of indptr
, indices
, and data
arrays.
For instance, one can compute the degree of a node by using
node = 1
degree = indptr[node+1] - indptr[node]
degree
2
Let us break down the above code.
indptr[node]
is the first index of theindices
array that corresponds to thenode
-th row of the adjacency matrix.indptr[node+1]
is the first index of theindices
array that corresponds to the(node+1)
-th row of the adjacency matrix.Thus,
indptr[node+1] - indptr[node]
is the number of non-zero entries in thenode
-th row of the adjacency matrix, which is the degree of thenode
-th node.
Using indices
, it is easy to identify the neighbors of a given node by using
neighbors = indices[indptr[node]:indptr[node+1]]
neighbors
array([ 2, 10])
where indices[indptr[node]:indptr[node+1]]
is the corresponding column IDs of the non-zero entries in the node
-th row of the adjacency matrix, which corresponds to the node IDs connected to the node
-th node.
The edge weights to the neighbors can be obtained by using
edge_weights = data[indptr[node]:indptr[node+1]]
edge_weights
array([1, 1])