Kernels for Machine Learning

Jonathan Hui
10 min readMay 27, 2024

--

In many machine learning problems, input data is transformed into a higher-dimensional feature space using a non-linear mapping to make it easier to find patterns or separate classes. Subsequently, linear models can project this feature space into a one-dimensional space for classification. For example, in the left diagram below, the red and blue dots cannot be separated by a straight line in their original two-dimensional space. However, when transformed into a three-dimensional space using a non-linear function, a plane can successfully separate the two classes.

Projection onto Higher Dimensional Space (Image by Author)

Motivation

Kernel functions offer a principled approach to detecting nonlinear relationships using linear algorithms. This is achieved by implicitly mapping the original input data into a feature space where the data might become linearly separable. For example, with the mapping 𝜙(x) = x₁² + x₂², the red and blue dots in the diagram above are now linearly separable in this newly transformed feature space, as opposed to their original input space.

However, in real-world problems, the feature space is often high-dimensional to untangle complex data, and the mapping ϕ is usually unknown. However, we can use machine learning or deep learning techniques to model ϕ. Once modeled, a linear algorithm such as Support Vector Machines (SVM) can be applied to the feature space. In the example below, a 2D input can be transformed into a pre-assumed 9D feature space using a third-degree polynomial. Then, inferences are made with the linear model θ.

However, in dealing with high-dimensional feature space, the transformation complexity becomes significant. In particular, 𝜙 can map to and often map to infinite-dimensional feature space. Computing an infinite number of components is infeasible. What makes kernel methods stand out is that it can handle high-dimensional and infinite features efficiently with the kernel trick.

Kernel Trick

Let’s demonstrate the method the hard way first, before introducing the kernel trick to make it more efficient. In kernel methods, we often utilize the similarity between an input data point and the training dataset to make predictions. A common approach involves calculating the inner product of two points after they have been transformed into a higher-dimensional feature space.

In the example below, φ(x) represents a mapping function that transforms an input vector x = (x₁, x₂) into a feature space defined by a second-degree polynomial:

Given two data points x = (1, 2) and y = (3, 4), the inner product in the basis (1, x1​, x2​) is 12.

Let’s compute the inner product in the feature space. The result is 144.

In this process, the feature mapping 𝜙 transforms the input vectors into a higher-dimensional space. Subsequently, the inner product reduces the result back to ℝ. Since the final result is a scalar in ℝ, we might question whether these transformations are necessary. In other words, can we find an equivalent function that directly computes the result without explicitly calculating 𝜙(x) and 𝜙(y)?

In our example, we observe that the final result (144) is simply the square of the inner product of the original input vectors (12² = 144). It can be shown through algebraic expansion that these two calculations are indeed equivalent.

In this example, the kernel function k is defined as the square of the inner product between the augmented vectors (1, x₁, x₂) and (1, y₁, y₂).

This is the essence of the kernel trick: the inner product of two vectors in a high-dimensional (or even infinite-dimensional) feature space can be computed efficiently using a kernel function in the original input space, bypassing the need for explicit feature mapping. This efficiency becomes particularly significant in infinite-dimensional feature spaces, where direct computation of the inner product would be impossible, but the kernel trick remains feasible.

Can this kernel trick be generalized? Indeed, for any ϕ, there exists a corresponding kernel k. For example, to raise an input space to a higher-degree polynomial, the corresponding kernel is simply:

However, our primary objective is not to explicitly find the feature space, which is generally unknown. Instead, we focus on the result, k(x, y), not the feature mapping 𝜙(x). In fact, we often design kernels directly without explicitly constructing or even knowing the corresponding feature space.

In fact, it is more intuitive to think in terms of kernels than feature spaces. For instance, if the similarity between neighboring points can be modeled with a Gaussian distribution, a Gaussian kernel would be a natural choice. As show later, this kernel corresponds to a feature space with infinite dimensions, theoretically allowing for arbitrarily complex decision boundaries.

If we can find a kernel that effectively solves machine learning problems, that is sufficient for our purposes. Understanding the feature space is a bonus, but it is not necessary for kernel design.

Announcement

Generative AI has been an exciting area of development in recent years, sparking my interest in writing a book on the topic. This project has been both immensely challenging and rewarding. During the writing process, I discovered many captivating subjects that didn’t quite fit the book’s main focus but still held valuable insights. These topics may be a bit raw due to time constraints, but I believe they’re worth exploring. Instead of discarding them, I’ve decided to share them through a series of articles on Medium.

While the book isn’t finished yet, you can follow its progress and be among the first to know when it’s complete by connecting with me on Medium or LinkedIn. Stay tuned for the official launch announcement!

Kernel

Let's give an official definition for a kernel.

Definition: Let X be a non-empty set.

For every valid kernel function, there exists at least one corresponding feature space where the kernel can be represented as an inner product of the feature space.

Radial Basis Function Kernel (RBF kernel)

Just for better understanding, let’s examine the feature spaces associated with the RBF kerne, which is defined as:

The RBF kernel implicitly maps data into an infinite-dimensional feature space. To analyze the RBF kernel, we can utilize the Taylor expansion of the exponential function. The Taylor expansions of exp(x) and exp(x x’) are:

Using the last Taylor expansion, the RBF kernel becomes

The RBF kernel implicitly maps data into an infinite-dimensional feature space 𝜙 below. This infinite-dimensional nature allows the RBF kernel to model highly complex decision boundaries.

Positive Definite Functions

Not all functions can be valid kernels. They must be positive semidefinite.

Definition

A function k: X × X → R is positive semidefinite if, for any vectors a in in ℝⁿ and xᵢ in X

it fulfills the condition:

In simpler terms, this means the function always yields a value greater than or equal to zero, regardless of the input values. If it equals zero only when 𝑎ᵢ =0 or aⱼ= 0 , the function is called positive definite.

Kernels are Positive Semidefinite

A continuous or finite-domain function k can be represented as the inner product of feature mappings in a Hilbert space H if and only if k is a positive semi-definite function.

As shown below, if a function k can be represented as an inner product in a feature space, then k is positive definite.

Furthermore, since the inner product is symmetric, any kernel function derived from the inner product must also be symmetric (k(x, y) = k(y, x)). This requirement for symmetry, combined with the requirement for positive definiteness, establishes a theoretical framework for constructing and validating new kernel functions.

Kernel Matrix

Or equivalently, for a function to be a valid kernel, the associated kernel matrix must be semipositive definite.

Definition: A n×n matrix K is positive semidefinite if

Mercer’s Theorem

We show that kernel functions are positive semidefinite. Mercer’s theorem states the converse is also true: any positive semidefinite function can be represented as the inner product of vectors in a corresponding feature space.

Ridge Regression

In this section, we will use Ridge Regression to demonstrate how to perform inference with the kernel function. In linear regression, Ridge Regression minimizes the mean squared error of predictions while also incorporating an 𝐿2 regularization term to prevent overfitting. The prediction for a data point x is equal to θx.

To find the optimal θ, we set the gradient of the objective function with respect to θ to zero and solve for θ. This yields the following closed-form solution:

We can express θ in the form of Xα, where α is:

G is called the Gram matrix, which is computed as XXᵀ and is therefore symmetric.

Given a new input x, a prediction can be made using f(x) = θx = θ, x ⟩.

This solution can be recognized as a linear combination of the inner products between the training data points and the new example 𝑥.

The coefficients α are derived from the Gram matrix G. This implies that we do not directly model the parameters θ. Instead, we utilize the training data itself to make predictions. This approach is a defining characteristic of non-parametric machine learning models.

Finally, we can make one more improvement: instead of using x as the input, we use its feature space representation ϕ(x). This applies to both the inner product and the Gram matrix. Here, the matrix G is also referred to as the Kernel Matrix.

The new prediction will be a linear combination of the inner products between the features of the training data points and the features of the new example x.

Kernel Design

We can combine kernels to design new ones. For example, adding two valid kernels together or multiplying a kernel by a positive scalar results in another valid kernel. Furthermore, multiplying two valid kernels together also forms a new valid kernel.

Here is a list of common kernels.

Let’s prove that the Gaussian kernel is a valid kernel by expressing it as the product of three exponential functions.

According to the rule

the first two terms forms a kernel

The third term is also a valid kernel. To see this, we note that the inner term is itself a kernel, as it is formed by the product of two functions, f(x) and f(y).

Using the Taylor series expansion, we can express the exponential function within the Gaussian kernel k₁ as an infinite sum of polynomial terms. Since both the sum and product of valid kernels are also valid kernels, we can conclude that the third term is indeed a valid kernel.

By the product rule again,

we can prove Gaussian kernel is a kernel.

--

--