Dealing with sparsity
Most datasets and models that we deal with in machine learning are either sparse by nature or constrained to be so.
Numpy provides data structures and algebraic routines for dense (fully populated) matrices only.
The support for sparse data-structures and routines is provided by scipy.
Sparse matrices are represented internally in many different forms.
To write efficient code, it is essential to use the correct form.
A full discussion is not possible here but here are the general guidelines.
- CSC (Compressed sparse column) and CSR (Compressed sparse row) formats are the most compact representations. As their names suggest, CSC is an array of columns and is more efficient at column operations than other representations. Conversely, CSR, being an array of rows, is more efficient at row operations.
- COO (Coordinate) and DOK (Dictionary of keys) representations are usually faster to intiialize with tabular data than CSC or CSR, but less efficient are certain operations.
Here are some operations to deal with sparse matrices in python using scipy. Many numpy operations directly work with sparse matrices, so pay attention which package is being invoked.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50 | # load the scipy sparse package and numpy package
import scipy.sparse as sp
import numpy as np
# create a sparse matrix in COO format of size m times n
A = sp.coo_matrix((m,n))
# create a sparse CSR matrix from available python lists
A = sp.csr_matrix([[1,0,2],[0,3,3],[2,3,0]])
# compute l2, l1, or max-norm of a sparse vector
l2norm = np.linalg.norm(a, ord=2)
l1norm = np.linalg.norm(a, ord=1)
maxnorm = np.linalg.norm(a, ord=np.inf)
# scale, shrink, or flip a vector
scaleda = 2*a
shrunka = 0.5*a
flippeda = -1.0*a
# linear combination of two vectors a and b
lincomb = 2*a + 3*b
# dot product of two vectors, a and b
dotprod = a.dot(b)
# solve Ax=b using . note that densification is needed so performance benefits are lost
x = np.linalg.solve(A.toarray(),b)
# get the upper triangular matrix from a full matrix "A"
uA = sp.triu(A)
# get the lower triangular matrix from a full matrix "A"
lA = sp.tril(A)
# matrix multiplication of a matrix with a vector
Ax = A.dot(x)
# find inverse of a matrix. Requires conversion to CSC or CSR first.
Ainv = np.linalg.inv(A)
# find determinant of a matrix
Adet = np.linalg.det(A)
# find eigendecomposition of a matrix. Q is an orthgonal matrix of eigenvectors, and d is a an array of eigenvalues
d,Q = np.linalg.eig(A)
# find SVD of a matrix
U, s, V = np.linalg.svd(A)
|