Linear Algebra

Foundations of Machine Learning

Machine learning (ML) would be too simple, ineffective, and quite frankly dull, if we could develop only univariate models on univariate datasets. Linear algebra provides the data-types, the tools, the operations, and the theory to enable the use of multivariate datasets and multivariate models. This article presents an overview of concepts from linear algebra that are essential to achieving mastery in ML.

Linear algebra is all about data types, their properties, and operations that can be performed on them. Data types are containers for data. They are named according to their capacity.

  • A scalar is a single value.
  • A collection of scalars is a vector.
  • A matrix is a collection of vectors.

We start our discussion with the most primitive of data types, the scalar.

Scalar

A scalar is the most primitive data type. It is a container for a single value. It could be a constant or a univariate variable.

Although there is no accepted standard notation, it is a common practice to denote scalar constants with greek characters and scalar variables with English alphabets. For example, \( \alpha, \beta, \gamma, \) and \( \lambda \) are all scalar constants. \( a, b, c, \ldots, x, y, z \) are all scalar variables.

Scalar variables can be real or complex. For machine learning, limiting our discussion to real-valued scalars will suffice. So, henceforth, \( a \in \real \)

Scalars support all basic arithmetic operations that you are already familiar with: addition, subtraction, multiplication, and division. Extended arithmetic operations such as roots, exponentiation, logarithms, and trigonometric functions are also supported.

In machine learning, most models are multivariate. But the vital tunable variables, the so-called hyper-parameters, are usually scalars. The success, downfall, and reproducibility of machine learning algorithms hinge on these hyper-parameters. So, scalars may be simple, but they are nothing to be scoffed at.

Vector

A vector is a collection of scalars. To indicate that a vector is a bigger container than a scalar, the vector notation uses bold-faced characters. For example

$$ \begin{equation} \va = \begin{bmatrix} a_1 \\ \vdots \\ a_n \end{bmatrix} \end{equation} $$

Here \( \va \) is a vector containing \( n \) values. This is succinctly represented as \( \va \in \real^n \).

A vertical vector, as represented in the equation above, is known as a column vector. The horizontal one is known as a row vector.

$$ \begin{equation} \vb = \begin{bmatrix} a_1 \ldots a_n \end{bmatrix} \end{equation} $$

A row vector can be transformed into a column vector and vice-versa with the transpose operator. It is denoted by super-script \( T \) on the vector being transposed. For example

$$ \vb = \va^T $$

$$ \va = \vb^T $$

Unless otherwise noted, in most machine learning literature, it is common practice to present equations and analyses in the column-vector form. In our series of articles, vectors will always be column-vectors.

Vector: Geometric representation

Drag the cricle to change the vector

\( \va = [ \) Text Text \( ] \)

Vector magnitude and direction

A vector has magnitude and direction. The magnitude of a vector is measured in terms of its norm. The \( p \)-norm, for \( p \ge 1 \), provides a generalized way of computing any norm of a vector.

$$ ||\va||_p = \left( \sum_{i=1}^n |a_i|^p \right)^\frac{1}{p} $$

Three norms, derived from \(p\)-norm, are popular in machine learning.

  • \(L_1\) or Manhattan or Taxi-cab norm

$$ ||\va||_1 = \sum_{i=1}^n |a_i| $$

  • \(L_2\) or Euclidean norm

$$ ||\va||_2 = \left( \sum_{i=1}^n a_i^2 \right)^\frac{1}{2} $$

  • \(L_\infty\) or max norm

$$ ||\va||_\infty = \max_{i = 1}^n |a_i| $$

A vector with unit Euclidean norm, \( || \va ||_2 = 1 \), is known as a unit vector.

Any vector can be normalized to a unit vector by dividing it with its norm, for example, the Euclidean norm.

$$ \vec{\va} = \frac{\va}{||\va||_2} $$

\(\vec{\va} \) has the same direction as that of the original vector, but has unit-magnitude.

Next, we present an interactive demonstration of vector norms. Stretch and rotate the vector. Note the change in the various norms to build an intuition about vector magnitudes and direction. Also, use the slider to change the norm type to get a shape of the unit ball, the region that is traversed by the unit vector of a given norm type. Understanding the shapes of these regions is crucial in choosing appropriate norm types in machine learning.

Vector: Norms

Drag the cricle on the arrowhead to change the vector
Label

\( ||\va||_1 =  \) Text

\( ||\va||_2 =  \) Text

\( ||\va||_\infty =  \) Text

Stretching or shrinking a vector

The magnitude of the vector can be stretched or shrunk. This is achieved by multiplication with a scalar.

$$ \vb = \alpha \va $$

The result, \( \vb \), is a vector with the same number of entries as \( \va \). It is easy to prove that the magnitude of \( \vb \) and that of \( \va \) are related.

$$ || \vb ||_2 = |\alpha| || \va ||_2 $$

Note that the scalar multiplication can also change direction.

  • If \( 0 < \alpha < 1 \), then \( || \vb ||_2 < || \va ||_2 \). No change in direction.
  • If \( \alpha > 1 \), then \( || \vb ||_2 > || \va ||_2 \). No change in direction.
  • If \( -1 < \alpha < 0 \), then \( || \vb ||_2 < || \va ||_2 \). Also, \( \vec{\vb} \) points in the opposite direction of \( \vec{\va} \)
  • If \( \alpha < -1 \), then \( || \vb ||_2 > || \va ||_2 \). Also, \( \vec{\vb} \) points in the opposite direction of \( \vec{\va} \)

To build intuition about scalar multiplication, try out our interactive demo presented next. Vary the value of the scalar multiplier \( c \) by adjusting the slider. Note the difference in the vector \( \va \) while you do that.

Vector: Stretch, Shrink, Flip

Drag the cricle to change the vector
Label

Rotating a vector

If a vector can be stretched or shrunk, so can it be rotated.

Here's a thought experiment.

Suppose an object \( \va \) is moving in a certain direction. Another object \( \vb \) strikes it, moving in another direction, not necessarily head-on.

What happens? The object changes its original trajectory.

The new trajectory and the new speed is a result of two conditions prior to the strike.

  • The speed of the object \( \va \) relative to that of the object \( \vb \).
  • The direction of the object \( \va \) with respect to that of the object \( \vb \).

So is the case with deflecting or rotating a vector. A vector \( \va \) can be rotated by introducing the effects of another vector \( \vb \).

$$ \vc = \alpha \va + \beta \vb $$

This operation of scaling each vector and summing them up is known as a linear combination. A linear combination of \( m \) vectors, \( \va_i \in \real^n, \forall i \in \{1,\ldots,m\} \) is defined as

$$ \vc = \sum_{i=1}^m \alpha_i \va_i $$

So, the result of a linear combination of a set of vectors is a new vector with the same number of entries, but different direction and magnitude.

If the \( \alpha_i \) are constrained to be positive and sum to one, \( \sum_i \alpha_i = 1, \forall \alpha_i \ge 0 \), then, the linear combination is more specifically known as a convex combination.

Build intuition about the linear combination operation with our interactive demo, presented next. Use the sliders to modify the scalars in the operation. Drag the pulsating input vectors to change their orientation. Note the impact on the result of the linear combination operation. Experiment with the convex combination by constraining the scalars appropriately.

Vector: linear combination

Drag the cricle to change the vector
Label
Label

Span, basis, and null-space

Now that you understand linear-combination, it is time to understand the span of a set of vectors.

The set of all vectors obtainable by a linear combination of a given set of vectors is known as the span of those vectors.

$$ \text{span}(\{\va_1,\ldots,\va_m\}) = \left\{ \vc \in \real^n : \vc = \sum_{i=1}^m \alpha_i \va_i, \forall \alpha_i \in \real \right\} $$

We saw in an earlier interactive demo that multiplication by a scalar stretches, shrinks, or flips a vector. So, it is easy to imagine that a single vector is unlikely to span a 2-dimensional space such as this screen.

From the linear-combination demo, you must have noticed that by just varying the scalars it is possible to move the output vector a whole 360 degrees. Try it if you didn't already.

Thus, you need at least 2 vectors to span the whole of a 2-dimensional space. Such 2 vectors are known as the basis set of that space.

From the same interactive demo, you will note that any two input vectors can be a basis for the 2-dimensional space. Unless the two input vectors are merely scaled versions of each other. Try it in the linear combination demo. Crucial point.

What if we introduced a third input vector in a 2-dimensional space. Well, no harm, but not benefit either. We are not covering any new ground. So, the basis of a space is defined as the minimal set of vectors that spans that space.

In the interactive demos that we have been using so far, the span of the two axes is the entire 2-dimensional plane of the screen. So, the two axes are known as the basis of this span or space.

Dot product

The dot product of two vectors is the sum of the element-wise product of the two vectors. So,

$$ \va \cdot \vb = \sum_{i=1}^n a_i b_i $$

Angle between two vectors

If \( \theta \) is the angle between two vectors \( \va \) and \( \vb \) in the Euclidean space, then \( \va \cdot \vb = ||\va||_2 ||\vb||_2 \cos\theta \).

Interact with the next demo to understand how the angle between two vectors relates to their dot products. The angle value varies between 0 and 180 degrees. The normalized product of two vectors is

  • positive when the angle between them is less than 90 degrees.
  • negative when the angle between them is more than 90 degrees.
  • zero when the angle between them is exactly 90 degrees.

Note that the unit vectors along the 2 axes we have been using for the interactive demos are \( \vx = [0,1] \) and \( \vy = [1,0] \) for the horizontal and vertical axis respectively. Their dot product is zero \( \vx \cdot \vy = 0 \). And they are perpendicular to each other.

Another important thing. \( \cos 0 = 1 \), so the dot product of a vector with itself is akin to finding the square of its magnitude \( \vx \cdot \vx = ||\vx||_2 ||\vx||_2 \). You will see this alternative way of arriving at vector magnitude in a number of places in machine learning literature.

Next, we present an interactive module to demonstrate how changing the angle between two vectors updates their dot product. Among other actions, understand the effect of superimposing the two vectors, putting them at a right angle, or pointing them in opposite directions.

Vector: Dot product

Drag the cricle to change the vector

Normalized product \( \frac{\va \cdot \vb}{||\va||_2 ||\vb||_2} \) = Text

Angle between \( \va \) and \( \vb \) = Text

Orthogonalization

We saw in the earlier demo that the dot product of two vectors is zero if the angle between them is 90 degrees, since \( \cos 90^\circ = 0 \). Such non-zero vectors with a zero dot product are said to be mutually orthogonal.

Also, if two orthogonal vectors are also unit vectors, then they are said to be orthonormal.

If a set of vectors is the basis for a space and they are also mutually orthonormal, then they are known as the orthonormal basis of that space. Note that the X-axis and the Y-axis are the orthonormal basis of a Cartesian coordinate system and any point on that space can be defined in terms of their X and Y coordinates.

An understanding of orthogonal or orthonormal vectors is important for ML applications. Data or models seldom automatically lead to orthogonal vectors. In machine learning, it is a common practice to discover the orthonormal basis for any given space. So, one uses the Gram-Schmidt approach to arrive at orthogonal vectors from any given set of vectors. This process is known as orthogonalization.

Here's how the Gram-Schmidt approach for orthgonalization works for a set of vectors \( \{\va_1, \ldots, \va_n \} \) into a set of mutually orthogonal vectors \( \{\vx_1, \ldots, \vx_n \} \)

  1. Let \( \vx_1 = \va_1 \).
  2. Compute the projection of the next vector \( \va_2 \) onto \( \vx_2 \). Intuitively, this is a portion of \( \va_2 \) in the direction of \( \vx_1 \). $$ \text{proj}_{\vx_1}(\va_2) = \frac{\va_2 \cdot \vx_1}{\vx_1 \cdot \vx_1} \vx_1. $$
  3. Now we need to just figure out the portion of \( \va_2 \) that cannot be explained by the projection \( \va_2(\vx_1) \). Remaining means subtraction, and that's what we will do. $$ \vx_2 = \va_2 - \text{proj}_{\vx_1}(\va_2) $$
  4. Continue the process for the next vector \( \va_i \), but this time identify the portion that is left after removing components for all previous orthogonal vectors. $$ \vx_i = \va_i - \sum_{j = 1}^{i-1} \text{proj}_{\vx_j}(\va_i) $$

Interact with the orthogonalization demo next to get a good intuitive feel for the process. Check out what happens when you change either of the input vectors. Note the change in the projection and the resulting orthogonalization result.

Vector: orthogonalization

Drag the cricle to change the vector

Input vectors

\( \va = [ \) Text Text \( ] \)

\( \vb = [ \) Text Text \( ] \)

Projection of \( \vb \) on \( \va \)

\( \text{proj}_{\va}(\vb) = \frac{ \va \cdot \vb}{ \vb \cdot \vb} \va \)

\( \text{proj}_{\va}(\vb) = [ \) Text Text \( ] \)

Orthogonal vector of \( \va \)

\( \text{orth}(\va) = \vb - \text{proj}_{\va}(\vb) \)

\( \text{orth}(\va) = [ \) Text Text \( ] \)

Linear equations

Let's digress a little bit. Linear Algebra was developed to solve systems of linear equations. So it is crucial we understand some basics about solving linear equations here.

Suppose you wish to solve an equation in 2 unknown variables.

$$ 2x + 4y = 22 $$

Can you solve this to find the value of \( x \) and \( y \)? No, not unless you can have more information.

Suppose the additional information is provided in the form of another equation

$$ x + y = 6 $$

Great! Now you can solve it. How did you do it?

Mechanically, you went through these steps.

  1. Multiply the second equation with 2 to get \( 2x + 2y = 12 \).
  2. Subtract this updated equation from the first equation to get \( 2y = 10 \).
  3. Solve this simple equation in one variable \( y = 5 \).
  4. Substitute back in the second equation to get \( x = 1 \).

Note how you first eliminated \( x \) to arrive at an equation in one variable \( y \). This mechanical process is known as the process of Gaussian elimination.

Gaussian elimination is a general purpose strategy to solve equations in any number of unknown variables.

Check out the next interactive demo to see Gaussian elimination in action. Vary the number of equations and the number of unknowns to see the steps needed to arrive at a solution. Try to understand when the system is unable to solve the equations. Building intuition here is key to understanding much of linear algebra.

(Note: We are deliberately not doing any exception handling or dealing with infinity and NaNs. Check out when you get them to strengthen your intuition.)

Gaussian elimination

Label
Label

Can every set of equations be solved?

If you have interacted with the demo above, you already know the answer. And also some pitfalls.

No, every set of equations cannot be solved. Here are the obvious outliers

  • If the number of unknowns is more than the number of equations
  • If the number of equations is more than the number of unknowns

In the first case, we have inadequate information to attempt to solve the problem. Many solutions will satisfy the given equations. But we will not be able to narrow down to the precise one.

In the second case, some equations might be inconsistent with the others. That means, the solution for a subset of equations will differ from another subset of equations. Hence, again we cannot precisely arrive at a single solution to the problem.

But what if the number of equations is the same as the number of unknowns. Let's take an example. Here are some equations in 2 variables.

$$ \begin{aligned} & 2x + 3y = 5 \\ & 4x + 6y = 10 \end{aligned} $$

If we were to do Gaussian elimination here, then to eliminate \( x \) from the second equation, we will subtract 2 times the first equation from the second to get \( 0x + 0y = 0 \) !

Not good! In our attempt to eliminate \( x \), we got rid of \( y \) too. We cannot solve for \( y \) anymore.

This happened because the second equation was a mere scaled up version of the first equation. In other words, the second equation, although seemingly different, conveyed the same information as the first one. It was not an independent piece of information.

What if we had more than 2, say 3, equations in 3 unknown variables? This is where things get interesting. Check out the next Gaussian elimination demo where we have purposely constrained the equations to have a certain nature. The last equation for all examples in this demo is a sum over scalar multiples of previous equations.

Here's what you will find. To be able to solve equations in \( n \) unknown variables, you need exactly \( n \) independent and consistent equations.

Gaussian elimination : Solvability

Label
Label

Representing equations as vector dot product

Now that we understand how to solve linear equations, how about we come up with a compact representation for equations?

An astute reader will note that a linear equation such as \( 2x_1 + 4x_2 = 22 \) can be written in the form of vector dot products that we presented earlier.

$$ \va \cdot \vx = y $$

Here, \( \vx \in \real^2 \) is a vector containing the variables, \( \vx = [x_1,x_2] \). Also, \( \va \in \real^2 \) is another vector containing the constants, \( \va = [2,4] \). And, \( y \) is the result.

This is a succinct representation and will be followed henceforth to represent equations.

What about independence of equations in terms of vectors? Continuing our discussion of independence, if one equation is a scaled version of another, then they are not independent. In vector notation, if \( \va_j = \alpha \va_i, i \ne j \), then the vectors are not independent. Conversely, if \( \va_j \ne \alpha \va_i, i \ne j \), then the vectors \( \va_i \) and \( \va_j \) are said to be linearly independent.

What if there are more than 2 equations and more than 2 variables? Let's suppose we have \( m \) equations and \( n \) variables. Easy, we create \( m \) vectors, each with \( n \) values. Each of the \( m \) equations results in \( m \) outputs.

Collectively, this can all be written as $$ \va_i \cdot \vx = y_i, \forall i = \{1,\ldots,m\} $$

What about independence for more than 2 equations? Just like we need 2 linearly independent vectors to solve for 2 unknowns, so do we need \( n \) linearly independent vectors to solve for \( n \) unknowns. Let's extend our formula of independence to more than 2 vectors.

So, if any of the \( m \) vectors can be written as a linear combination of the remaining set of vectors, \( \va_j = \sum_i \alpha_i \va_i, \forall i, i \ne j \), then the set of vectors are not linearly independent.

In plain terms, a group of \( m \) vectors is said to be linearly independent if none of them can be written as the linear combination of the rest.

And, to solve for a \( n \) unknowns, you need \( n \) linearly independent equations or vectors.

A more concise representation for equations

Now that we have a concise representation for a single equation as a vector, how about we come up with a composite representation over multiple vectors. Let's stack each vector in the set \( \{\va_1, \ldots, \va_m \}, \va_i \in \real^n, \forall i \) as follows and refer to the entire box by the symbol \( \mA \).

$$ \mA = \begin{bmatrix} \va_{1,1} & \ldots & \va_{1,n} \\ \vdots & \ddots & \vdots \\ \va_{m,1} & \ldots & \va_{m,n} \\ \end{bmatrix} $$

Here, \( \va_{i,j} \) represents the \( j \)-th element of the \( i \)-th vector.

This composite representation of multiple vectors is known as a matrix.

Each horizontal list of elements is known as a row of the matrix. For example, the elements \( [\va_{1,1}, \va_{1,2}, \ldots, \va_{1,n}] \) form the first row of the matrix \( \mA \).

Similarly, each vertical list of elements is known as a column of the matrix. For example, the elements \( [\va_{1,1}, \va_{2,1}, \ldots, \va_{m,1}] \) form the first column of the matrix \( \mA \).

Naturally, the diagonal, or the list of elements from the top left to the bottom right, are known as the diagonal of the matrix. For example, the elements \( [\va_{1,1}, \va_{2,2}, \ldots, \va_{n,n}] \) form the diagonal of the matrix \( \mA \).

So, now that the vectors of multipliers have been stacked, how about the vector of unknowns and outputs?

Easy, we define a new operation on the matrix, the so-called matrix-multiplication to make it as succinct as

$$ \mA \vx = \vy $$

How cool is that? Matrix multiplication works by taking a row from the matrix \( \mA \), say \( \va_i \) , and performing a dot product of \( \va_i \) with the column vector \( \vx \) of unknowns to arrive at the answer \( y_i \). Doing so over all the rows of the matrix leads to \( m \) outputs, thereby creating our output vector \( \vy \).

It should be noted that the number of results in \( \vy \) is the same as the number of rows of \( \mA \).

Gaussian elimination can be easily re-implemented for matrices as simple row-operations. Multiply a row with an appropriate scalar and subtract it from another row to eliminate a variable and continue to the solution as we demonstrated earlier.

What about the independence requirements we figured out earlier? Well, they still apply. More succinctly so. To solve equations in \( n \) variables you need \( n \) linearly independent rows and \( n \) linearly independent columns in the matrix \( \mA \).

Now that we have introduced the matrix, there is no looking back. All further discussion will be about matrices (plural of matrix), their properties, their types, and the operations defined on them.

Matrix rank

Of course, the discussion about independence still applies. Since each row of the matrix represents a vector from the original equation, the system of equations in \( n \) unknowns represented by a matrix is solvable only if there are \( n \) linearly independent rows of in \( \mA \).

The number of linearly independent rows of a matrix is known as the row rank of the matrix.

Conversely, since each column is also a vector, the number of linearly independent columns of a matrix is known as the column rank of the matrix.

A fundamental result of linear algebra states that The row rank and column rank of any matrix are always equal. This common number of independent rows or columns is simply referred to as the rank of the matrix. The proof is out of scope here, but it is worth checking out.

Some interesting facts about matrix rank.

  • If a matrix has any row or column with an element, then it's rank is at least \( 1 \).
  • Empty matrices have zero rank.
  • The maximum possible rank of a matrix is less than the lesser of the number of columns or rows of that matrix.

$$ \text{rank}(\mA) \le \min(m,n) , \mA \in \real^{m \times n} $$

So, getting back to solving linear equations, to solve \( \mA \vx = \vb \), where \( \vx \in \real^n \), \( \mA \) needs to have \( n \) linearly independent rows. In other words, \( \textbf{rank} ( \mA ) = n \).

If every row of a matrix is linearly independent of the rest, then it is called a full row rank matrix. Similarly, if every column of a matrix is linearly independent of the rest, then it is referred to as a full column rank matrix.

The matrix is said to be full rank if it has the maximum achievable rank for a matrix of its size, limited only by the number of rows or columns.

A matrix that is not full rank , is considered to be rank deficient.

We know from our previous discussion on Gaussian elimination that we need a full-rank matrix to arrive at a solution.

Matrix types

Some matrix operations and theory apply to special types of matrices, so it is good to be aware of them.

A square matrix is one with an equal number of rows and columns. Matrices with unequal rows and columns are known as rectangular, by the virtue of their shape.

A symmetric matrix is one with the exact same entries on either side of the main diagonal. That means \( \mA \) is symmetric if and only if \( \va_{i,j} = \va_{j,i}, \forall i, j \in \{1,\ldots,n \} \). Needless to say, a matrix is symmetric only if it is also square.

A Hermitian matrix is the complex analogue of real symmetric matrix. It is a complex square matrix that is equal to its own conjugate transpose.

A diagonal matrix is a square symmetric matrix one with non-zero elements along the main diagonal and zeros elsewhere. It is usually denoted as \( \text{diag}(d_1,\ldots,d_n) \).

An identity matrix is a diagonal matrix that has \( 1 \)'s along the main diagonal and \( 0 \)s everywhere else. It is usually denoted as \( \meye \), such that

$$ \meye_{i,j} = \begin{cases} 1 & \text{ ,if } i = j \\ 0 & \text{ ,if } i \neq j \end{cases} $$

An upper triangular matrix is one which has non-zero elements along the main diagonal and above the main diagonal. All other elements are zero.

Conversely, a lower triangular matrix is one which has zeros above the main diagonal and non-zero elements elsewhere.

A matrix with mutually orthonormal rows and columns is known as an orthogonal matrix. A Unitary matrix is the complex analogue of real orthogonal matrix.

We saw that multiplying a matrix to a vector leads to a linear equation. There are also classes of matrices based on their definiteness, a general concept in mathematics for quadratic forms. So, a square symmetric matrix \( \mA \in \real^{n \times n} \) is called

  • positive definite if \( \vx^T \mA \vx > 0, \forall \vx \in \real^n \).
  • positive semi-definite if \( \vx^T \mA \vx \ge 0, \forall \vx \in \real^n \).
  • negative definite if \( \vx^T \mA \vx < 0, \forall \vx \in \real^n \).
  • negative semi-definite if \( \vx^T \mA \vx \le 0, \forall \vx \in \real^n \).
  • indefinite if \( (\vx^T \mA \vx)(\vy^T \mA \vy) < 0 \), for some \( \vx \in \real^n \) and \( \vy \in \real^n \).

In this section, we have merely defined the various matrix types. To really build intuition about what these actually mean, we first need to understand the effect of multiplying a matrix with a vector. We present this next.

Matrix as a transformation on a vector

One way to look at our system of linear equations \( \mA \vx = \vy \) is to think that the matrix \( \mA \) is acting as a transformation on the vector \( \vx \) and resulting in another vector \( \vy \).

This is akin to vector scaling and rotation that we presented earlier. Let's build some intuition about matrix as a transformation on a vector. It will be crucial for understanding some complex matrix transformations and operations that we present later.

Let us start with the identity matrix. We are interested in the product \( \meye \vx = \vy \). Expressing it in a verbose form, for let's say 2 dimensions, helps to understand what really happens.

$$ \begin{aligned} & \meye \vx = \vy \\ & \implies \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \\ & \implies \begin{bmatrix} 1x_1 + 0x_2 \\ 0x_1 + 1x_2 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \\ & \implies \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} y_1 \\ y_2 \end{bmatrix} \\ & \implies x_1 = y_1 \text{ and } x_2 = y_2 \\ & \implies \vx = \vy
\end{aligned} $$

This essentially means that multiplying by an identity matrix preserves the original vector.

What about multiplication by a scaled identity matrix, say \( \alpha \meye \). It is easy to extend the analysis above to note that \( \vy = \alpha \vx \).

What about multiplying a general vector \( \vx \in \real^n \) by a general diagonal matrix, say \( \text{diag}(\alpha_1, \ldots, \alpha_n) \). Again, easy to work out the math to note that each element of the resulting vector will be scaled by the corresponding factor in the diagonal matrix, so that \( \vy = [y_1, \ldots, y_n] \) such that \( y_i = \alpha_i x_i, \forall i \in \{1, \ldots, n \} \).

So, the moral of the story is that multiplication with a diagonal matrix transforms the input vector. This may seem like a trivial transform, but it is a handy tool in implementing computationally efficient code. You would be surprised by the number of for-loops that can be avoided by simple diagonal matrix multiplication.

Check out the next interactive demo to understand this transformation effect of a diagonal matrix on an input vector. In this demo, we have constrained the matrix to be a diagonal matrix. Check out what happens when all diagonal elements of the matrix are the same or distinct. Check if you can scale, flip, and make the input vector span the entire space by just multiplying with a diagonal matrix.

Diagonal matrix transformations

Drag the cricle to change the associated vector

Input matrix \( \mA \)

Text

Input vector \( \vx \)

\( [ \) Text Text \( ] \)

Output vector \( \mA \vx \)

\( [ \) Text Text \( ] \)

Matrix as a transformation on space

Instead of checking the impact of a matrix on a single vector, wouldn't it be cool to check out the overall impact of a matrix on an entire collection of vectors, rather, the entire space?

This is what we do in the upcoming demos. We have represented the space with dots lined along concentric circles. Each dot represents the end point of a vector in that space.

The size of the dot is proportional to its distance from the center. Thus, the size of the dot is indicative of the magnitude of the vector it represents. (You know why!)

The color of the dot changes based on the angle the vector makes with the \(X\)-axis. So, all dots with the same color represent vectors that are at the same angle with the \(X\)-axis.

We start with demonstrating the effect of a diagonal matrix. You can modify a particular diagonal element of the matrix by dragging the corresponding axis vector. You will note the following:

  • Changing the diagonal elements stretches or shrinks the space along the corresponding axis. There is no effect on the spacing of the dots along the other axis.
  • The relative ordering of points remains the same, implying a linear transform.
  • There is absolutely no rotation.
  • Negating the values along a diagonal has the effect of flipping the axis and space.

Diagonal matrix transformations on space

Drag the cricle to change the associated vector

Input matrix \( \mA \)

Text
Input \( \vx \) space
Output \( \mA \vx \) space

Orthogonal matrix transformation on space

Now that we know that diagonal matrices stretch and shrink the space, let's try to understand the impact of orthogonal matrices on the space.

Many of the important matrix factorizations and decompositions deal with orthogonal matrices. This demo will help you understand the implications of orthogonal transformations.

In the next demo, we will constrain the matrix to be an orthogonal matrix. That means that the rows vectors will be constrained to be orthonormal. Each of them is constrained to be a unit vector and the angle between them is constrained to be 90 degrees.

To enable this demo, you can drag only one vector in the demo. The other one adjusts itself to remain orthonormal to the first one.

Check out how an orthogonal matrix only rotates the input space. And it rotates the input space by the same amount as the matrix is rotated. Also note that there is absolutely no stretching or shrinking due to an orthogonal matrix.

Orthogonal matrix transformations on space

Drag the cricle to change the associated vector

Input matrix \( \mA \)

Text
Input \( \vx \) space
Output \( \mA \vx \) space

Upper Triangular matrix transform on space.

Upper triangular matrices are zero below the main diagonal.

In the next demo spin the row representing the \(X\)-axis (row 1 in the demo), a full 360 degrees. Observe how the space rotates along the \(Y\)-axis due to multiplication by an upper triangular matrix. There is no rotation along the \(X\)-axis. This is because the \(Y\)-axis does not participate in the transformation. Of course, except the stretching/shrinking/flipping along \(Y\)-axis, if you modified that.

Upper triangular matrix transformations on space

Drag the cricle to change the associated vector

Input matrix \( \mA \)

Text
Input \( \vx \) space
Output \( \mA \vx \) space

Lower Triangular matrix transform on space.

Lower triangular matrices are zero above the main diagonal.

I am sure you know what to expect in this case.

In the next demo spin the row representing the \(Y\)-axis (row 2 in the demo), a full 360 degrees. Observe how the space rotates along the \(X\)-axis due to multiplication by a lower triangular matrix.

As expected, there is no rotation along the \(Y\)-axis. The reason is same as before. The \(X\)-axis does not participate in the transformation. Of course, except the stretching/shrinking/flipping along \(X\)-axis, if you modified that.

Lower triangular matrix transformations on space

Drag the cricle to change the associated vector

Input matrix \( \mA \)

Text
Input \( \vx \) space
Output \( \mA \vx \) space

Symmetric matrix transform on space

Symmetric matrices are a very important family of matrices that appear very often in machine learning literature.

We know what influence diagonal and orthogonal matrices have on an input space. A symmetric matrix could sometimes be diagonal or orthogonal, so we already know what to expect in those situations.

In the next demo, we have constrained the first element of the second row vector to move in sync with the second element of first row vector. This ensures a symmetric matrix at all times.

Keep a watch on the dark blue dots lining the \(X\)-axis as you move the first row of the matrix. These points rotate along with that vector.

Along notice the light green dots. These always align with the second row vector.

Also note that when the two row vectors are aligned along the same plane, the space is squashed into a single dimension. So all points fall into a line and the line is along the same direction as the direction of the first row.

Every time the two row vectors align the space flips over, as the vectors cross each other.

Now stabilize the first row and see what happens when you move the second row along its constrained plane. You will notice that there is no more rotation. Instead, there is stretching, shrinking, and inverting across the plane set up by the first row.

It is almost like the first row vector is setting the direction (rotation) of the space and then the second dependent vector is acting as a diagonal matrix along that orientation.

Try out the demo to see if you can visualize these ideas.

Symmetric matrix transformations on space

Drag the cricle to change the associated vector

Input matrix \( \mA \)

Text
Input \( \vx \) space
Output \( \mA \vx \) space

Positive definite matrix transform on space

A positive definite matrix is any symmetric matrix \( \mA \in \real^{n \times n} \) such that \( \vx^T \mA \vx > 0 \).

How might such a matrix impact the underlying space?

Try out the next demo to find out. We have constrained the matrix to be positive definite. To ensure that, we have used a trick.

Here's how you can generate a positive definite matrix. Easy to verify that being a square product, \( \vy^T \vy > 0 \) for all non-zero \(\vy \in \real^n \). If \( \vy = \mA \vx \), then that essentially means \( (\mA \vx)^T (\mA \vx) > 0 \). Re-arranging, we get \( \vx^T \mA^T \mA \vx > 0 \). This means, \( \mA^T \mA \) must be a positive definite matrix, even if \( \mA \) is not.

Such matrices \( \mA^T \mA \) and \( \mA \mA^T \) appear quite often in machine learning literature. It is good to know that a matrix multiplied with its transpose always results in a positive definite matrix.

Getting back to the next interactive demo. You wil be modifying the matrix \( \mA \). All the transforms will be shown in the form of \( \mA^T \mA \).

Being a symmetric matrix, much of what you saw in the symmetric matrix transform will hold here.

With one big difference. The PD matrix never turns the space more than 90 degrees. Pay attention to the dark green points (or pick your favorite color) as you move. See how constrained they are in the transformed space. The angle between each vector \( \vx \) in the original space and the output of the transform, \( \mA^T \mA \vx \) is always less than 90 degrees.

Thus, each point is constrained to an imaginary orthant around it. This is almost equivalent to multiplying by a positive number. That is one intuition behind positive definite.

Some observations to note though. Positive definite matrices may have negative elements. The diagonal elements of a positive definite matrix are always positive.

Can you think about what will happen if the matrix were negative definite?

Positive definite matrix transformations on space

Drag the cricle to change the associated vector

\( \mA \)

Text

\( \mA^T \mA \)

Text
Input \( \vx \) space
Output \( \mA^T \mA \vx \) space

Matrix inverse

Now that we understand the matrix multiplication operation procedurally and visually, it is time to understand it's counterpart. The matrix inversion operation.

Matrix inversion reverses the effect of a matrix multiplication. The matrix inverse of a matrix \( \mA \) is denoted by \( \mA^{-1} \).

By definition, it should preserve the vector multiplied by a matrix. So, if \( \mA \vx = \vy \), then it should be the case that \( \mA^{-1} \vy = \mA^{-1}(\mA \vx) = \vx \).

Thus, \( \vx = \mA^{-1}\vy \), a solution to our linear system of equations in variables \( \vx \). Now we know from our earlier discussion that \( \vx \) has a solution only if \( \mA \) is full rank, \( n \) if \( \vx \in \real^n \).

The smallest full rank matrix \( \mA \in \real^{m \times n} \) that can have a rank \( n \) is a square matrix such that \( m = n \), or rather \( \mA \in \real^{n \times n} \).

Thus, given the requirements for matrix multiplication, the inverse is also a square matrix of the same size \( \mA^{-1} \in \real{n \times n} \).

We require \( \mA \) to be full rank matrix for \( \mA \vx = \vy \) to have a solution. So, for \( \mA \) to be invertible or for the inverse to exist, it needs to be square and full-rank. No inverse, no solution. No solution, no inverse. Easy as that.

Thus, a matrix inverse is only defined for a square matrix.

Calculating an inverse

We saw earlier how to solve equations in \( n \) unknowns by Gaussian elimination. Well, finding the inverse of a \( n \times n \) matrix requires solving for \( n^2 \) unknows. So, how do we do it?

Well, here's the key information we know. \( \mA \mA^{-1} = \meye \). This is a system of \( n^2 \) equations on \( n^2 \) unknowns. So, as long as \( \mA \) is full rank, we should be able to solve it. Also, we can use plain old Gaussian elimination to do it. Same old row multiplication, subtraction, variable elimination.

What about rectangular matrices?

The Moore-Penrose pseudoinverse, denoted as \( \mA^+ \), is a generalization of the inverse for rectangular matrices. We do not deal with that often in machine learning literature, so this will not be covered here.

Building intuition about the inverse operation

To build further intuition, check out this demo involving matrix inverse. The right-most column of charts shows the space transformations \( \vx, \mA\vx, \mA^{-1}\mA \vx \), in that order. The bottom-most row of charts shows the space transformations \( \vx, \mA^{-1} \vx, \mA \mA^{-1}\vx \).

Notice how the inverse of the matrix rotates the space in a direction opposite that by the input matrix.

Also notice that when the matrix stretches the space along a direction, its inverse shrinks it along the same direction.

The effect is reversed when the matrix shrinks it.

Thus, the net effect of \( \mA^{-1} \mA \vx \) is the recovery of the original space \( \vx \), as evidenced in the last panel of the bottom row of charts.

Cool, isn't it?

Inverse matrix transformation recovery

Drag the cricle to change the associated vector
Input \( \vx \) space
Transformed \( \mA \vx \) space
Input \( \vx \) space
Transformed \( \mA^{-1} \vx \) space
Recovered \( \mA \mA^{-1} \vx \) or \( \mA^{-1} \mA \vx \) space

Inverse of an orthogonal matrix

The inverse of an orthogonal matrix is particularly interesting.

Let \( \mQ \) is an orthogonal matrix.

Remember that the rows and columns of an orthogonal matrix are orthonormal vectors. So a dot product of a row-vector with itself is 1. The dot product between any pair of mutually orthonormal vectors is 0.

We know that the following must be true for an orthogonal matrix \( \mQ \mQ^{-1} = \meye \). In \(\mQ \mQ^{-1} \), we take the dot product of rows of \( \mQ \) with the columns of \( \mQ^{-1} \).

The result \( \mQ \mQ^{-1} = \meye \) implies that the \( i \)-th row of \( \mQ \) is same as the \(i\)-th column of \( \mQ^{-1} \). Moreover, the \( i \)-the row is orthogonal to the rest of the columns \( j \ne i, \forall j \).

This would be possible if \( \mQ^{-1} = \mQ^T \). Thus, the transpose of an orthogonal matrix is its inverse. And because a matrix always has a transpose, an orthogonal matrix is always invertible and never singular.

This is a key result. It makes it desirable to factorize or decompose a matrix such that some components are orthogonal matrices. We will study two such decompositions: Eigendecomposition and Singular value decomposition.

Determinants

In the inverse matrix recovery demo, you must have noticed that the space collapses, and is irrecoverable, if the two row vectors of the matrix \( \mA \) are perfectly aligned, either in the same direction or opposite. (Try it if you haven't noticed that already)

Let's think of this from a visual perspective. Imagine a parallelogram described by the row vectors of the matrix \( \mA \). This parallelogram has area.

When one row vector of the matrix is a scaled version of another, the parallelogram has no area.

And when that is the case, the matrix is not invertible.

In more than two dimensions, this idea can be generalized to a parallelopiped with volume. If the parallelopiped defined by the row vectors of a matrix has zero volume in \( n \)-dimensional space, then the matrix is not invertible.

This area or more generally volume is called the determinant of the matrix. A matrix with zero determinant is not invertible. Matrices with zero determinants are known as singular matrices.

Imagine a two-dimensional matrix \( \mA \)

$$\mA = \begin{bmatrix} a & b \\ c & d \\ \end{bmatrix}$$

It's determinant, or area of the corresponding parallelogram, denoted by \( | \mA | = ad - bc \). (Verify this to understand it)

As a special case, it is easy to verify that the volume or determinant of a diagonal matrix in any number of dimensions is a simple product of its diagonal elements. Thus, if any diagonal element is zero, the matrix has a zero volume or zero determinant. Consequently, a square diagonal matrix is not invertible if any diagonal element is zero.

It should be obvious that being a volume or area, a determinant is only defined for square matrices. (Think why).

Now, despite being an area the determinant is not always positive. For example, the determinant of an orthogonal matrix is either \(+1\) or \(-1\).

In the next demo, check out the imaginary parallelogram described by the rows of the matrix. Note how the area changes, in sign and magnitude as you modify the matrix.

Determinant as area or volume

Drag the cricle to change the associated vector

Input matrix \( \mA \)

Text

Status: Text

Determinant = Text

Eigenvectors and eigenvalues

We understood linear transformations resulting from a matrix multiplication.

What if the vector \( \vy \) resulting from \( \mA \vx \) was just a scaled version of the input vector \( \vx \), \( \mA \vx = \vy = \lambda \vx \)?

That means, \( \mA \vx = \lambda \vx \). This can be interpreted to mean that the matrix \( \mA \) is merely stretching or shrinking the vector \( \vx \). There is no rotation.

Of course, such an interesting transformation may not happen for any random pair of a matrix and a vector. So, the relationship between \( \mA \) and \( \vx \) must be special if \( \mA \vx = \lambda \vx \).

Indeed, such a vector \( \vx \) is known as the eigenvector of the matrix \( \mA \). The stretch/shrinkage that it undergoes, \( \lambda \), is known as the eigenvalue, corresponding to that eigenvector.

Now trivially, if a vector \( \vx \) is an eigenvector for a matrix, then so is any scaled version of it \( \gamma \vx \). So, when one talks about eigenvectors, they are always referring to unit eigenvectors such that \( || \vx ||_2 = 1 \).

Here are some important facts to know about eigenvalues and eigenvectors

  • There could be between 0 and \( n \) eigenvalues and eigenvectors for an \( n \times n \) matrix.
  • Eigenvalues and eigenvectors are not defined for rectangular matrices.
  • Square symmetric matrices have real eigenvalues and a complete set of orthonormal eigenvectors.
  • If two or more eigenvectors share the same eigenvalue, then any set of orthogonal vectors lying their span are also eigenvectors with that eigenvalue.
  • All eigenvalues of a positive definite matrix are positive.
  • All eigenvalues of a positive semi-definite matrix are non-negative.

Eigendecomposition

Suppose a matrix \( \mA \in \real^{n \times n} \) is a square matrix with \( n \) linearly independent eigenvectors \( \vx_1, \ldots, \vx_n \) and corresponding eigenvalues \( \lambda_1,\ldots,\lambda_n \).

Let's define a new matrix \( \mQ \) to be a collection of these eigenvectors, such as each column of \( \mQ \) is an eigenvector of \( \mA \)

$$ \mQ = \begin{bmatrix} \vx_1 ~\ldots~ \vx_n \end{bmatrix} $$

Now,

$$ \mA \mQ = \begin{bmatrix} \mA \vx_1 ~\ldots~ \mA \vx_n \end{bmatrix} $$

And because \( \vx_i, \forall i \) are eigenvectors, \( \mA \vx_i = \lambda_i \vx_i \). Substituting

$$ \mA \mQ = \begin{bmatrix} \lambda_1 \vx_1 ~ \ldots ~ \lambda_n \vx_n \end{bmatrix} $$

This means \( \mA \mQ = \mQ \mLambda \), where we have let \( \mLambda \) denote diagonal matrix of eigenvalues.

$$ \mLambda = \begin{bmatrix} \lambda_1 & \ldots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \ldots & \lambda_n \end{bmatrix} $$

Since the matrix \( \mQ \) is invertible, (Do you know why?), we can multiply both sides of the equation with \( \mQ^{-1} \).

$$ \mA = \mQ \mLambda \mQ^{-1} $$

Here, \( \mathbf{\Lambda} \) is a diagonal matrix, such that the elements along the main diagonal are eigenvalues of \( \mA \). The matrix \( \mathbf{Q} \in \real^{n \times n}\) is a square matrix such that the \( i \)-th column of \( \mathbf{Q} \) is the eigenvector of \( \mA \) corresponding to the eigenvalue in the \( i \)-th row of \( \mathbf{\Lambda} \).

Any full-rank square matrix can be factorized this way. This factorization of the matrix into its eigenvalues and eigenvectors is known as the eigendecomposition of the matrix.

For real symmetric matrices, eigendecomposition is guaranteed. The Spectral Theorem states that the eigendecomposition of a real symmetric matrix leads to orthogonal eigenvectors \( \mA = \mQ \mLambda \mQ^T \).

When an eigendecomposition exits, it may not be unique. If there are duplicate eigenvalues, then any orthogonal vector in the span of the corresponding eigenvectors is also an eigenvector. Thus, an alternate \( \mQ \) can be chosen.

Eigendecomposition is one of the approaches to finding the inverse of a matrix that we alluded to earlier. If a matrix can be eigendecomposed, then finding its inverse is quite easy. Using properties of inverses listed before.

$$ \inv{\mA} = \inv{(\mathbf{Q}\mathbf{\Lambda}\inv{\mathbf{Q}})} = \mathbf{Q} \inv{\mathbf{\Lambda}} \inv{\mathbf{Q}} $$

We know from previous discussion that the inverse of the orthogonal matrix \( \mQ \) is cheap and it always exists.

Since \( \mathbf{\Lambda} \) is a diagonal matrix, its inverse, \( \inv{\mathbf{\Lambda}} \), can also be easily computed. It is merely another diagonal matrix whose entries are the reciprocals of the values along the diagonal of the original matrix.

$$ \mLambda^{-1} = \begin{bmatrix} \frac{1}{\lambda_1} & \ldots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \ldots & \frac{1}{\lambda_n} \end{bmatrix} $$

Now here's the interesting part. If any of the eigenvalues is zero, then the diagonal \( \mLambda \) is not invertible (its determinant is zero!), and by consequence, \( \mA \) is also not invertible.

In fact,

Eigendecomposition appears repeatedly in machine learning literature, sometimes as the key step of the learning algorithm itself.

Check out the next interactive demo to understand eigendecomposition visually. The column on the right shows the space transformed by \( \mA \). The bottom row shows the same transformation, arrived at by sequentially multiplying with SVD components. It is presented in the following order: \( \vx \rightarrow \mQ^{-1}\vx \rightarrow \mLambda\mQ^{-1}\vx \rightarrow \mQ \mLambda \mQ^{-1} \vx \).

Note how the first transformation \( \mQ^{-1}\vx \) mostly rotates the space into an orientation that is amenable to the appropriate stretch or shrinkage by the \( \mLambda \) in the second step. In the final step, the transformation matrix \( \mQ \), being an inverse of the first step, reverses the rotation from the first step.

Try to modify the matrix such that one of the elements of the diagonal matrix \( \mLambda \) becomes almost zero. Notice how the space gets squashed along that direction. Also, observe what happens when you end up with a zero eigenvalue in the \( \mLambda \) matrix.

Eigendecomposition transformation recovery

Drag the cricle to change the associated vector
Input \( \vx \) space
Output \( \mA \vx \) space
Input \( \vx \) space
\( \mQ^{-1} \vx \)
\( \mLambda \mQ^{-1} \vx \)
\( \mQ \mLambda \mQ^{-1} \vx = \mA \vx \)

Spectral Theorem

For the case of symmetric matrices, eigendecomposition is special.

If \( \mA = \mA^T \), a symmetric matrix, then \( \mQ^T = \mQ^{-1} \). So,

$$ \mA = \mQ \mLambda \mQ^T $$

In other words, the eigendecomposition of a symmetric matrix leads to an orthogonal matrix.

In the next demo, a replica of the previous transformation recovery demo, but with symmetry constraints on \( \mA \), check out the first step. \( \mQ^T \vx \). Since \( \mQ \) is orthogonal, and so is its inverse, there is no stretching/shrinking in the first step. Only rotation. Any stretching or shrinking is happening in the second step \( \mLambda \mQ^T \vx \).

Check also that the eigenvectors (eig row 1 and eig row 2) are always orthonormal.

Spectral Theorem transformation recovery

Drag the cricle to change the associated vector
Input \( \vx \) space
Output \( \mA \vx \) space
Input \( \vx \) space
\( \mQ^{T} \vx \)
\( \mLambda \mQ^{T} \vx \)
\( \mQ \mLambda \mQ^{T} \vx = \mA \vx \)

Singular value decomposition

Eigendecomposition is only defined for square matrices. For rectangular matrices, we turn to singular value decomposition.

Every real matrix \( \mA \in \real^{m \times n} \) can be factorized as follows

$$ \mA = \mU \mD \mV^T $$

Such formulation is known as the Singular value decomposition (SVD).

Here, the columns of \( \mU \) are known as the left-singular vectors of matrix \( \mA \). \( \mU \in \real^{m \times m} \) is an orthogonal matrix.

The columns of \( \mV \) are known as the right-singular vectors of the matrix \( \mA \). \( \mV \in \real^{n \times n} \) is an orthogonal matrix.

And \( \mD \in \real^{m \times n} \) is a diagonal matrix containing singular values of the matrix \( \mA \).

Note that \( \mU \) and \( \mV \) are square matrices The diagonal matrix \( \mD \) is not square, unless \( \mA \) is a square matrix.

The relationship between SVD and Eigendecomposition

In SVD, the roles played by \( \mU, \mD, \mV^T \) are similar to those of \( \mQ, \mLambda, \mQ^{-1} \) in eigendecomposition. But that similarity ends there.

Here's an important statement that people have trouble remembering. SVD of a square matrix may not be the same as its eigendecomposition.

In fact, the SVD and eigendecomposition of a square matrix coincide if and only if it is symmetric and positive definite (more on definiteness later). In that case,

$$ \mA = \mU \mD \mV^T = \mQ \mLambda \mQ^{-1} \implies \mU = \mV = \mQ \text{ and } \mD = \mLambda $$

In general though, the SVD and Eigendecomposition of a square matrix are different. The most important differences are listed below

  • Singular values are always non-negative, but eigenvalues can be negative
  • The matrices \( \mU \) and \( \mV \) in an SVD are always orthogonal. But the matrix \( \mQ \) in an eigendecomposition may not be orthogonal. A symmetric matrix guarantees orthonormal eigenvectors, other square matrices do not.

For rectangular matrices, some interesting relationships hold.

Now, we know that for any rectangular matrix \( \mA \), the matrix \( \mA^T \mA \) is a square symmetric matrix. So, eigendecomposition is possible. Moreover, it has real eigenvalues and orthonormal eigenvectors

$$\begin{align} & \mA^T \mA = \mQ \mLambda \mQ^T \\ & \implies \left(\mU \mD \mV^T \right)^T \left(\mU \mD \mV^T\right) = \mQ \mLambda \mQ^T \\ & \implies \mV \mD \mU^T \mU \mD \mV^T = \mQ \mLambda \mQ^T \\ & \implies \mV \mD^2 \mV^T = \mQ \mLambda \mQ^T \\ \end{align}$$

Here, we have used the fact that \( \mU^T \mU = I \) since \( \mU \) is an orthogonal matrix.

Thus, the columns of \( \mV \) are actually the eigenvectors of \( \mA^T \mA \).

Moreover, the singular values along the diagonal of \( \mD \) are the square roots of the eigenvalues in \( \mLambda \) of \( \mA^T \mA \).

A similar analysis leads to the result that the columns of \( \mU \) are the eigenvectors of \( \mA \mA^T \).

Understanding \( \mU, \mD, \text{ and } \mV \) in SVD

Now that we know that eigendecomposition is different from SVD, time to understand the individual components of the SVD.

We saw in an earlier interactive demo that orthogonal matrices rotate and reflect, but never stretch. So that's the role of \( \mU \) and \( \mV \), both orthogonal matrices.

But, \( \mU \in \real^{m \times m} \) and \( \mV \in \real^{n \times n} \). So they perform the rotation in different spaces.

Since \( \mU \) and \( \mV \) are strictly orthogonal matrices and only perform rotation or reflection, any stretching or shrinkage has to come from the diagonal matrix \( \mD \).

Hence, the diagonal non-zero elements of \( \mD \), the singular values, are non-negative. Any dimensions with zero singular values are essentially squashed. Dimensions with higher singular values are more dominant (stretched) and conversely, those with lower singular values are shrunk. And therein lies the importance of SVD.

By focusing on directions of larger singular values, one might ensure that the data, any resulting models, and analyses are about the dominant patterns in the data. This is achieved by sorting the singular values in magnitude and truncating the diagonal matrix to dominant singular values. That will entail corresponding adjustments to the \( \mU \) and \( \mV \) matrices by getting rid of the rows or columns that correspond to lower singular values.

So, if we are focused on the \( r \) top singular values, then we can construct an approximate or compressed version \( \mA_r \) of the original matrix \( \mA \) as follows:

$$ \mA_r = \mU_r \mD_r \mV_r^T $$

This is a great way of compressing a dataset while still retaining the dominant patterns within.

In fact, the number of non-zero or positive singular values of a matrix is equal to its rank.

Machine learning is all about working with the generalizable and dominant patterns in data. Some details might be lost. In fact, in some cases, it is desirable to ignore irrelevant details to avoid the phenomenon of overfitting. And this is where SVD helps.

As a consequence, the SVD appears in numerous algorithms in machine learning. In the upcoming learning modules, we will highlight the importance of SVD for processing and analyzing datasets and models.

Other factorizations

Linear algebra is a vast subject and we have only scratched the surface. In this section, we list some additional factorizations that you might encounter sometimes in machine learning literature.

  • LU Factorization is a subtle extension of the Gaussian elimination method that we studied earlier. Just as in Gaussian elimination for solving \( \mA \vx = \vy \), we factorize the coefficient matrix \( \mA \) such that \( \mA = \mathbf{L}\mathbf{U} \). Here, \( \mathbf{L} \) is a lower triangular matrix with 1's along the diagonal and \( \mU \) is an upper triangular one. So, effectively, the Gaussian elimination step is equivalent to arriving at the matrix \( \mU \). Any square matrix with non-zero determinant has a LU factorization, which is effectively the same condition as what we would have for Gaussian elimination.
  • LDL Factorization is a further extension of LU, in that \( \mA = \mathbf{L} \mD \mU \). Here, \( \mU \) also has 1's along the diagonal and \( \mD \) has the pivots.
  • Cholesky Factorization is a decomposition that applies to symmetric positive definite matrices. Any symmetric positive definite matrix \( \mA \in \real^{n \times n} \) can be factorized as \( \mA = \mathbf{G}\mathbf{G}^T \), where \( \mathbf{G} \) is a lower triangular matrix with positive diagonal entries.
  • QR Factorization Any rectangular matrix \( \mA \in \real^{m \times n} \) can be factorized as \( \mA = \mQ \mathbf{R} \), where \( \mQ \in \real^{m \times m} \) is an orthogonal matrix and \( \mathbf{R} \in \real^{m \times n} \) is an upper triangular matrix.

The good, the bad, and the ugly matrix

You will notice in most machine learning literature, that the authors squeal with delight if they are working with certain matrices. Why? Because, as you have noticed so far, certain operations, especially inversion, becomes possible, sometimes easier, for some types of matrices.

First, let's talk about the ugliest of all. A rectangular matrix that is also singular. We cannot invert it. Eigendecomposition is not defined for it. Only thing we can definitely do is a singular value decomposition on it.

Among bad matrices, there are levels of bad. The worst are those that are singular, even if they are square or symmetric. The better ones are non-singular matrices that are also symmetric or at least square. They are invertible and also have an eigendecomposition.

The best matrices are the ones that are easier to work with. Identity, diagonal, orthogonal, and positive definite matrices are the best because of their rich properties and theory.

So, whenever you encounter a matrix, be on the lookout for some of these properties. If you are not dealing with a desirable one, try to take the extra effort to bring it closer to a better one.

Check out this next chart which shows a rough matrix hierarchy to identify which ones are invertible, symmetric, and positive definite. It is not a Venn-diagram in the true sense, but just shows containment. For example, all orthogonal matrices are invertible.

Matrix type relationships

Computational complexity

To implement computationally efficient machine learning code, you should be using matrix operations. In this section, we highlight the computational complexity of some important matrix operations to help you make faster code.

Note: There are theoretically efficient algorithms available for each of these operations, but it is always safe to assume that the package is not implemented as efficiently as it theoretically could be.

Computational complexity of common matrix operations
Operation Input Output Worst case complexity
Dot product of two vectors \((\va, \vb) \rightarrow \va \cdot \vb \) Two \( n\) element vectors Scalar \( \BigO{n} \)
Matrix multiplication with vector \((\mA, \vx) \rightarrow \mA \vx \) \( (m \times n) \) matrix with \( n\) element vector \( n\) element vector \( \BigO{mn} \)
Matrix multiplication with matrix \((\mA, \mB) \rightarrow \mA \mB \) \( (m \times n) \) matrix with \( (n \times p) \) matrix \( (m \times p) \) matrix \( \BigO{mnp} \)
Determinant \( \mA \rightarrow \text{det}(\mA) \) \( (n \times n) \) square matrix Scalar \( \BigO{n^{2.3}} \) to \( \BigO{n^4} \)
Matrix inversion \( \mA \rightarrow \mA^{-1} \) \( (n \times n) \) square matrix \( (n \times n) \) square matrix \( \BigO{n^3} \)
Eigendecomposition \( \mA \rightarrow \mQ \mLambda \mQ^T \) \( (n \times n) \) square matrix Three \( (n \times n) \) square matrices, \( \mQ, \mLambda, \mQ^{-1} \) \( \BigO{n^3} \)
Singular value decomposition \( \mA \rightarrow \mU \mD \mV^T \) \( (m \times n) \) matrix One each of \( (n \times n) \) square matrix, \( (m \times n) \) matrix, and \( (m \times m) \) matrix \( \BigO{mn^2}, m \le n \)

Linear algebra in python

Python is currently the most popular choice for implementing machine learning algorithms. Major libraries such as Tensorflow, PyTorch, Scikit-learn all support or are developed in python.

To leverage linear algebra routines in Python, the most comprehensive, and possibly the de facto standard is the numpy package.

Listed below are some examples of linear algebra operations in numpy.

        

# in following examples scalars have been chosen arbitrarily to present the concept

# load the numpy library
import numpy as np

# create a 2-dimensional vector from python lists
a = np.asarray([1,2])


# compute l2, l1, or max-norm of a vector
l2norm = np.linalg.norm(a, ord=2)
l1norm = np.linalg.norm(a, ord=1)
maxnorm = np.linalg.norm(a, ord=np.inf)

# scale, shrink, or flip a vector
scaleda = 2*a
shrunka = 0.5*a
flippeda = -1.0*a

# linear combination of two vectors a and b
lincomb = 2*a + 3*b

# dot product of two vectors, a and b
dotprod = np.dot(a,b)

# angle between two non-zero vectors
angle = np.dot(a,b)/(np.linalg.norm(a) * np.linalg.norm(b))

# projection of a vector "b" onto vector "a"
bona = np.dot(b,a)/(np.linalg.(a)*np.linalg(a)) * a

# create a matrix from python list of lists
A = np.asarray([[1,2],[3,4]])

# solve Ax=b using 
x = np.linalg.solve(A,b)

# find rank of a matrix "A"
r = np.linalg.matrix\_rank(A)

# get the upper triangular matrix from a full matrix "A"
uA = np.triu(A)

# get the lower triangular matrix from a full matrix "A"
lA = np.tril(A)

# matrix multiplication of a matrix with a vector
Ax = np.dot(A,x)

# find inverse of a matrix
Ainv = np.linalg.inv(A)

# find determinant of a matrix
Adet = np.linalg.det(A)

# find eigendecomposition of a matrix. Q is an orthgonal matrix of eigenvectors, and d is a an array of eigenvalues
d,Q = np.linalg.eig(A)

# find SVD of a matrix
U, s, V = np.linalg.svd(A)


# test if the matrix A is positive definite
if np.all(np.linalg.eigvals(A)  > 0):
  print("Positive definite")
else:
  print("Not positive definite")

# test if the matrix A is positive semi-definite or positive definite
if np.all(np.linalg.eigvals(A)  >= 0):
  print("Positive semi-definite")
else:
  print("Not positive semi-definite")

    

Dealing with sparsity

Most datasets and models that we deal with in machine learning are either sparse by nature or constrained to be so. Numpy provides data structures and algebraic routines for dense (fully populated) matrices only. The support for sparse data-structures and routines is provided by scipy.

Sparse matrices are represented internally in many different forms. To write efficient code, it is essential to use the correct form. A full discussion is not possible here but here are the general guidelines.

  • CSC (Compressed sparse column) and CSR (Compressed sparse row) formats are the most compact representations. As their names suggest, CSC is an array of columns and is more efficient at column operations than other representations. Conversely, CSR, being an array of rows, is more efficient at row operations.
  • COO (Coordinate) and DOK (Dictionary of keys) representations are usually faster to intiialize with tabular data than CSC or CSR, but less efficient are certain operations.

Here are some operations to deal with sparse matrices in python using scipy. Many numpy operations directly work with sparse matrices, so pay attention which package is being invoked.



# load the scipy sparse package and numpy package
import scipy.sparse as sp
import numpy as np

# create a sparse matrix in COO format of size m times n
A = sp.coo\_matrix((m,n))

# create a sparse CSR matrix from available python lists
A = sp.csr\_matrix([[1,0,2],[0,3,3],[2,3,0]])


# compute l2, l1, or max-norm of a sparse vector
l2norm = np.linalg.norm(a, ord=2)
l1norm = np.linalg.norm(a, ord=1)
maxnorm = np.linalg.norm(a, ord=np.inf)

# scale, shrink, or flip a vector
scaleda = 2*a
shrunka = 0.5*a
flippeda = -1.0*a

# linear combination of two vectors a and b
lincomb = 2*a + 3*b

# dot product of two vectors, a and b
dotprod = a.dot(b)

# solve Ax=b using . note that densification is needed so performance benefits are lost
x = np.linalg.solve(A.toarray(),b)

# get the upper triangular matrix from a full matrix "A"
uA = sp.triu(A)

# get the lower triangular matrix from a full matrix "A"
lA = sp.tril(A)

# matrix multiplication of a matrix with a vector
Ax = A.dot(x)

# find inverse of a matrix. Requires conversion to CSC or CSR first. 
Ainv = np.linalg.inv(A)

# find determinant of a matrix
Adet = np.linalg.det(A)

# find eigendecomposition of a matrix. Q is an orthgonal matrix of eigenvectors, and d is a an array of eigenvalues
d,Q = np.linalg.eig(A)

# find SVD of a matrix
U, s, V = np.linalg.svd(A)


Further reading and resources

Here are some additional resources on linear algebra to supplement the material presented here.

Next article: Multivariate Calculus

With a sound understanding of Linear Algebra, we continue our journey to mathematical expertise with a deep dive into Calculus for Machine Learning.

Supplemented with numerous interactive demos, the article is self-contained and builds intuition about everything you need to know about Multivariate Calculus to take on important challenges such as Optimization and Inference.

Review: Self-assessment

  • Do you know difference between scalars, vectors, and matrices?
  • How can you find the magnitude and direction of a vector?
  • What are \(L_1, L_2, L_\infty\), and in general, \( L_p \) norms? What are Euclidean, Manhattan, Taxi-cab, and Max-norms?
  • What is a transpose operation?
  • What is span, nullspace, and range of a group of vectors?
  • What is vector dot-product?
  • What is linear independence of vectors?
  • What matrices are called singular matrices?
  • What are orthogonal vectors?
  • How do you compute the angle between two vectors?
  • How do you represent a system of linear equations as matrices?
  • What is matrix rank? What matrices are rank-deficient? Which are full-rank?
  • What is the fundamental theorem of linear algebra?
  • What are the rank requirements to be able to solve a group of linear equations represented as matrices?
  • What are square, symmetric, identity, and orthogonal matrices?
  • What positive definite and positive semi-definite matrices?
  • Do you understand the transformative effects of various types of matrices?
  • What is a matrix inverse?
  • What matrices are invertible? What are not?
  • What are eigenvalues and eigenvectors?
  • How many eigenvalues and eigenvectors can a matrix have?
  • What do you know about the eigenvalues and eigenvectors of symmetric matrices?
  • What do you know about the eigenvalues and eigenvectors of positive definite and positive semi-definite matrices?
  • What is eigendecomposition? Which matrices can be eigendecomposed?
  • What is singular value decomposition? Which matrices have an SVD?
  • What is the relationship between SVD and eigendecomposition?
  • Can you determine if a given matrix is positive definite or positive semi-definite?