Machine Learning & Linear Algebra — Special Matrices

9 min readMar 1, 2019

In literature, we often deal with statements like “since the matrix is symmetric, it has orthonormal eigenvectors”. It sounds like you should know them by heart. In linear algebra, there are special matrices with properties that are easy to analyze and to manipulate. They may have specific eigenvalues or special relationship among their eigenvectors. They obey rules that make operations easy. And there are methods to factorize a matrix into these “easier” matrices.

The reduction in operation complexity improves scalability. However, even these matrices are special, they are no way uncommon. In machine learning and many applications, we deal with them all the time. In this article, we will introduce them so you feel less intimidated when you heard “positive definite matrix has all positive eigenvalues”.

Diagonal matrix

A diagonal matrix S has all non-diagonal elements equal zero.

Many factorization methods have one of the decomposed matrices to be a diagonal matrix. Since the matrix contains diagonal elements only, we sometimes write it in term of a vector.

The inverse of a generic matrix is not easy to calculate. But finding the inverse of a diagonal matrix is easy. We can replace the diagonal element with 1/m.

Also, matrix multiplication is much simpler if one of the matrices is diagonal. But when any diagonal element equals zero or the diagonal matrix is not square, its inverse does not exist. But yet, pseudoinverse (keep the inverse of 0 as 0) can be used as a substitute in some methods.

Orthogonal matrix

An orthogonal matrix Q is a square matrix that fulfills the following requirement.

All the columns (v₁, …, vᵢ , …) in Q are orthonormal, i.e. vᵢᵀ vⱼ =0 for i≠j and vᵢ are all unit vectors.

This sounds like a tough requirement but for some matrix, like a symmetric matrix, we can choose our eigenvectors to be orthonormal during factorization.

The following matrix is orthogonal.

And

Like a diagonal matrix, its inverse is very easy to compute — the inverse of an orthogonal matrix is its transpose. This is one key reason why orthogonal matrices are so handy.

Proof:

If we multiply x with an orthogonal matrix, the errors present in x will not be magnified. This behavior is very desirable for maintaining numerical stability.

The relation AAᵀ=I simplifies many computations, like projections. For example, we can start with Ax = b.

And the projection vector can be simplified as

where qᵢ is column i of Q.

Symmetric matrix

A matrix is symmetric if its transpose equals itself.

For example,

Symmetric matrices are one of the most important matrices in linear algebra and machine learning. In machine learning (ML), we often use matrices to hold f(vᵢ , vⱼ). Such functions are often symmetrical, f(x, y) = f(y, x), and the corresponding matrix is therefore symmetric. For instance, in machine earning, f can measure the feature distance between data points or calculates the covariance of features.

Symmetric Matrix Properties

A symmetric matrix S is an n × n square matrices.

Its inverse is also symmetrical.
All eigenvalues of S are real (not a complex number).
We can choose n eigenvectors of S to be orthonormal even with repeated eigenvalues.
AᵀA and AAᵀ are symmetrical (usually AᵀA ≠ AAᵀ). In machine learning, the covariance matrix with zero-centered data is in this form.

AᵀA is invertible if columns of A are linearly independent.
Every symmetric matrix S can be diagonalized (factorized) with Q formed by the orthonormal eigenvectors vᵢ of S and Λ is a diagonal matrix holding all the eigenvalues.

The equation above can be rewritten as

where v are unit vectors. Therefore the eigenvalue term λᵢ dominates the importance of each term above. In fact, if it is too small, we can drop the corresponding term λᵢvᵢvᵢᵀ completely.

This factorization property and “S has n orthogonal eigenvectors” are two important properties for a symmetric matrix.

Orthonormal eigenvectors

Eigenvectors are not unique. But often, we can “choose” a set of eigenvectors to meet some specific conditions. As mentioned before, the eigenvectors of a symmetric matrix can be chosen to be orthonormal. If S is a symmetric matrix, its eigenvalue λ and μ fulfill the following condition.

Proof

From this condition, if λ and μ have different values, the equivalency force the inner product to be zero. Therefore, x and y are orthogonal and it is easy to normalize them to have unit length — orthonormal. This proves that we can choose eigenvectors of S to be orthogonal if at least their corresponding eigenvalues are different. But even with repeated eigenvalue, this is still true for a symmetric matrix.

Proof — part 2 (optional)

For an n × n symmetric matrix, we can always find n independent orthonormal eigenvectors. The largest eigenvalue is

To find the maximum, we set the derivative of r(x) to 0. After some manipulation, it can be shown that

i.e. r(x) achieves the highest ratio if x is an eigenvector with the largest eigenvalues. By induction, we can deduce that we can find the next highest eigenvalue with an eigenvector orthogonal to the previous one. This is only a high-level description of the proof. For details, please refer to section 7.2 in Gilbert Strang’s book (see reference at the end).

Spectral theorem

Let’s summarize. Every n × n symmetric matrix S has n real eigenvalues λᵢ with n chosen orthonormal eigenvectors vᵢ.

These eigenvalues can form a diagonal matrix Λ as diag(λ). We can also concatenate eigenvectors vᵢ to form V. i.e.,

We rename V as Q. Because Q is orthogonal, it is invertible and Qᵀ = Q⁻ ¹. Therefore, the symmetric matrix S can be factorized into

This is the Spectral theorem. Because finding transpose is much easier than the inverse, a symmetric matrix is very desirable in linear algebra.

Positive Definite Matrix

Positive definite matrix has all positive eigenvalues. If a matrix is invertible, its determinant is non-zero. As the determinant equals the products of all of the eigenvalues, a positive definite matrix’s determinant is positive. Therefore, it is invertible.

The term below computes the energy of a system with state x. If M is positive definite, it guarantees the energy stays positive unless x is zero.

So if energy is positive, the corresponding M should be positive definite. It sounds unusual but many matrices in real-life problems are positive definite.

There are many equivalent conditions to test positive definiteness. A matrix M is positive definite if any of the tests below are true:

All eigenvalues > 0,

2. All upper left determinants > 0,

3. All pivots > 0,

4. Energy > 0 except at x = 0,

5. M can be constructed by AᵀA and columns of A are linearly independent.

Verifying all eigenvalues are positive takes a lot of works. Therefore, conditions 2 or 3 are a more common test. For example, positive pivots mean positive eigenvalues (or vice versa). On the other hand, if we prove a matrix is positive definite with one of the tests above, we guarantee that it owns all the properties above.

Besides positive definite, we also have positive semidefinite, negative definite, and negative semidefinite. Positive semidefinite replace all the “>” conditions above with “≥”. For example, its eigenvalues are greater or equal to 0. Negative definite and negative semidefinite are the opposite of positive definite and positive semidefinite.

Proof

In this section, we will prove some of the properties above. If M is positive definite, all λ is positive. Therefore, the computed energy of the corresponding state x is positive (except x = 0).

If M is composed of AᵀA, M is positive semi-definite under the energy test.

AAᵀ and AᵀA are positive semi-definite. If the columns in A are linear independent, they are positive defininte.

Minimum

In calculus, we set the first-order derivative of f to zero to find its critical point. However, such a point can be a maximum, a minimum, or a saddle point. Many machine learning model expresses the cost function in the quadratic form xAᵀx. Knowing whether this function is a convex function is important. Since if it is convex, we know that the local minimum serves as the global minimum also. If A is positive definite, the quadratic function is convex.

For any function f, we compute the Hessian matrix below. If A is positive definite, the corresponding point is a local minimum.

Covariance matrix

In machine learning, we are very interested in finding the correlation between properties. The figure below demonstrates a positive relationship between weight and height.