Machine Learning — Visualization, Dimension Reduction & Anomaly detection
Visualization & dimension reduction
Multi-dimensional scaling (MDS)
MDS visualizes data in low dimension. PCA extracts independent features orthogonal to each other. While this is good for model training, It often loses the structure of the data and hard for visualization. MDS focuses on creating a mapping that preserves the relative distance between data. If 2 points are close in the feature space, it should be close in the latent factor space.
We create a cost function that maintains the relative distance as
In Sammon mapping, the loss function is re-calibrated with the distance of the input features. So we will not miss out the fine detail structure.
Data points may lay on a manifold, like the swiss roll below. We can discover the structure and cluster data points better by following the manifold instead.
First, we need to redefine the distance between two points. The geodesic distance is the distance along the manifold. So even the Euclidean distance between point A and R is smaller than A and C below, AC has a much shorter geodesic distance — AC has a geodesic distance of 2 while AR has a geodesic distance of 17.
Here is the high-level IsoMap algorithm:
- Find the close neighbors around each point (points within a fixed radius or K nearest neighbors).
- Construct a graph connecting neighboring nodes with edge length equals the Euclidean distance.
- Compute the shortest path between two nodes using a shortest path algorithm — this is the geodesic distance.
- Run MDS using the computed geodesic distance.
t-sne (t-Distributed Stochastic Neighbor Embedding)
t-sne is a MDS with special functions for d1, d2, and d3.
t-sne produces better gaps between each cluster.
E-mail spam filter
We define words that may differentiate spam or not spam easier than others. Then we use bag of words to represent each email.
Then, we compute the probability that it is spam or not spam based on Naive Bayes’ Theorem.
To avoid underflow in multiple small numbers, we take the logarithm of the equation which turns those probability multiplications into additional. People may be more tolerance on false negative than false positive in classify e-mail as spam. So, instead of comparing the
With Gaussian Distribution
Through clustering, outliers are either not part of any cluster or far away from its centroid or neighbors.
But the cluster’s neighbor distances (r) are different in different clusters. To have a fair comparison, we normalize such distance by the average neighboring distance. Conceptually, we use the distance scale of the closest cluster in determining the outlier-ness.
But when two clusters are close together, we want the normalization part to include nodes containing the relevant cluster only. That can be done by a reverse lookup. First, we look for its k-neighbors. Then we check whether these nodes will consider yourself as one of its k-neighbors. If not, we will not include it in the normalization.