Machine Learning — Visualization, Dimension Reduction & Anomaly detection

Jonathan Hui
4 min readAug 9, 2019

Visualization & dimension reduction

Multi-dimensional scaling (MDS)

MDS visualizes data in low dimension. PCA extracts independent features orthogonal to each other. While this is good for model training, It often loses the structure of the data and hard for visualization. MDS focuses on creating a mapping that preserves the relative distance between data. If 2 points are close in the feature space, it should be close in the latent factor space.

Source: Wikipedia

We create a cost function that maintains the relative distance as

Sammon mapping

In Sammon mapping, the loss function is re-calibrated with the distance of the input features. So we will not miss out the fine detail structure.

Source

IsoMap

Data points may lay on a manifold, like the swiss roll below. We can discover the structure and cluster data points better by following the manifold instead.

Source

First, we need to redefine the distance between two points. The geodesic distance is the distance along the manifold. So even the Euclidean distance between point A and R is smaller than A and C below, AC has a much shorter geodesic distance — AC has a geodesic distance of 2 while AR has a geodesic distance of 17.

Left diagram source

Here is the high-level IsoMap algorithm:

  • Find the close neighbors around each point (points within a fixed radius or K nearest neighbors).
  • Construct a graph connecting neighboring nodes with edge length equals the Euclidean distance.
  • Compute the shortest path between two nodes using a shortest path algorithm — this is the geodesic distance.
  • Run MDS using the computed geodesic distance.

t-sne (t-Distributed Stochastic Neighbor Embedding)

t-sne is a MDS with special functions for d1, d2, and d3.

t-sne produces better gaps between each cluster.

Source

Anomaly detection

E-mail spam filter

We define words that may differentiate spam or not spam easier than others. Then we use bag of words to represent each email.

Then, we compute the probability that it is spam or not spam based on Naive Bayes’ Theorem.

Some tips:

To avoid underflow in multiple small numbers, we take the logarithm of the equation which turns those probability multiplications into additional. People may be more tolerance on false negative than false positive in classify e-mail as spam. So, instead of comparing the

With Gaussian Distribution

Clustering

Through clustering, outliers are either not part of any cluster or far away from its centroid or neighbors.

Outlier-ness

But the cluster’s neighbor distances (r) are different in different clusters. To have a fair comparison, we normalize such distance by the average neighboring distance. Conceptually, we use the distance scale of the closest cluster in determining the outlier-ness.

But when two clusters are close together, we want the normalization part to include nodes containing the relevant cluster only. That can be done by a reverse lookup. First, we look for its k-neighbors. Then we check whether these nodes will consider yourself as one of its k-neighbors. If not, we will not include it in the normalization.

--

--