# Machine Learning — Recommender System

E-commerce companies and content providers use recommender systems to suggest related products or content to users. Even there is still plenty of room for improvement, recommender systems are one of the most successful stories in machine learning ML.

However, as Mike Tyson said, “Everyone has a plan until they get punched in the mouth.” The industrial recommender systems are built from lessons learned in years. They deal with diversified objectives with tough scalability constraints. They are not minimizing the Mean Square Error for their predictions based on historical datasets. I just want to point out the elephant in the room first. In this article, we will cover basic academic concepts. But please be aware that real production systems are tedious with many ensembles methods done in multiple steps. In fact, industrial deployments may use quite different algorithms developed over many years of experiments and experience. For this reason, we have updated the random walk section on real production systems.

**Content-based filtering**

Content-based filtering recommends items based on the similarity in features between an item and a user preference. For example, we can extract features from a movie description and match what a user prefers. For example, the movie “Spy” can be featured as a McCarthy movie and will match users that watch McCarthy movies regularly. Information about a movie can be extracted from the movie title, descriptions, and information provided by the studio. A user preference can be determined by explicit information given by a user, like the demographic information, or information that can be indirectly collected from what the viewers watch, interact and purchase. Many recommender systems need to deal with the cold start problem when there is not enough information collected for a new user. Deriving user meta-information from user interactions is important. For example, we can collect descriptions for movies that a user watched and use that as input to extract a user preference.

There are many feature extraction techniques. We can create a bag of words from the movie description or from the words collected for the user preference.

Then we use the TF-IDF method to recompute the values in this vector. It reduces the importance of a word that frequently occurs in movie descriptions.

We compare the similarity of the corresponding vector of a movie and a user preference for recommendations. For example, we can use the cosine similarity to match a user to different movies.

Other measuring methods, including the Pearson Coefficient, L1/L2 distance or Jaccard Similarity, can be used subject to how the vector is calculated. If you have problems to catch up with the ML terms so far, you can spend some time on this article first.

Like many ML methods, data preprocessing is important. We review the vocabulary used in preparing the bag of words and limit its size. Stop words, like “the”, should be removed. Review top words manually and remove them if it is irrelevant in your problem domain. We may also match phrases and treat them as a single word in the vocabulary. In addition, there are variants of words that should match to the same word. For example, all plurals can be treated as singulars.

Nevertheless, even TF-IDF is frequently mentioned in the recommender system, this method reduces the importance of common category words like “comedy” which is important in making recommendations. TF-IDF has an important role in a query-type searching. But in some contexts, it may produce too narrow focus recommendations that hurt the diversity of the recommendations. So some systems may eliminate some common and rare words that have little impact on recommendations. Then, they simply use the frequency or the occurrence of the words instead of TF-IDF. The exact choice is influenced by the application context. The words selected in the bags of words, the number of item suggestions and what functions the application is providing are all important factors. This is a good example that we cannot take textbook recommendations for granted. We are dealing with far more complex objectives and they often conflict with each other. In practice, many real systems use A/B testing to determine their design choices.

**Collaborative Filtering (User to user or Item to item)**

Content-based filtering has a limitation on the quality of the description provided by the content provider. Technically, there is a limitation of what features can be extracted from the limited amount of content information. Extracting “content” information from the user is hard too. In reality, it is easier to judge people by what they do than what they say. Behavior information, like what they view, how they rate items, is much easier to collect, less intrusive and more accurate.

In collaborative filtering, we focus on making predictions through user behaviors instead of extracting content features. The abundance of this information can make collaborative filtering appealing.

In collaborative filtering, we are provided with labels but not the content features. It collects behavior information on how you interact with items — what you rank, how you rank, what you view and/or what you purchase. One typical approach is to create a user rating matrix containing the users’ ratings on movies. We find similarities and make recommendations.

For example, we locate people that give similar movie ratings as you and make suggestions for movies that you have not watched but ranked highly from these people (user-to-user). Alternatively, instead of finding similarities in people, we can find similarities in movies (item-to-item) and make suggestions from what you already watch. Regarding the cold start problem, item-to-item suggestions may be easier when information for a new user is limited.

One possible implementation of collaborative filtering is to locate the *k*-nearest neighbors and use its average as our prediction. Or we can compute the similarities (*u*) between you and all other users and uses them to compute a weighted average for your possible ratings. Of course, some approximation or simplification is needed to scale the second approach.

Some people may be more generous in ratings. Therefore, we may need to normalize each person’s vote before finding their similarity. In the example below, we use the Pearson correlation to compute similarity.

**Collaborative Filtering (Matrix factorization)**

In a previous ML article, we use matrix decomposition to find the principal components of the data. This is an abstract linear algebra concept that is extremely important in ML.

SVD above is one of the matrix factorization methods. It decomposes data into independent principal components like the one below.

In a nutshell, from a rating matrix, we learn the latent factors (the principal components) that users use in rating movies. For example, these latent factors may be the genre of the movie, the year of the release, the actor or the actress in the movie. But we don’t define it explicitly. We do not know what the latent factors learned unless we examine the results manually. We let the computer to learn them from the data, just like other ML methods. Matrix decomposition is a big topic and we will not elaborate on it further. A high-level discussion can be found here or a more detailed one in here.

But the number of latent factors can be huge but many of them may not be important. In PCA, we simply truncate those that are not important (those with very small singular value *σᵢ* comparing to others).

So even the rating matrix is large, we can reduce the information to latent factors in representing a user or a movie in a much lower dimension. This is why the concept is also termed as low ranking matrix factorization. We are reducing the rank of the matrix (dimension) through factorization.

With the concept of SVD, any user rating matrix can be decomposed into matrix *P* containing user latent factor and matrix *Q* containing item latent factor. The movie rating from a user can be reconstructed by multiplying the corresponding latent factor of the user and the movie.

But how can we find the matrices if there are missing entries in the user rating matrix? Users rate or interact with a handful of movies only. To solve that, we apply SGD to learn the decomposed matrix *Z* and *W*.

But we include errors for valid entries in the rating matrix ** R** only. We will not introduce an error to the cost function for entries with missing ratings.

In addition, there will be rating bias in individual user and some cult movie can be rated much higher than average. We will add those biases when making a prediction and add regularization terms for them in the cost function.

# Random walk

Many Markov processes can be solved with random walks. Let’s say we drop off 100 shoppers randomly around the downtown area in San Franciso and we provide a transition matrix to show the probability of where the shoppers may head next from the current position. Eventually, we can spot which shops people usually visited. Recommender systems may use similar concepts except we drop all shoppers in one place (for example, the product that you are currently viewing) and check where the shoppers are after a certain time.

We can represent each movie as a node in a graph. Based on many factors including what products users may select next, items that may be viewed or purchased together, what products a user rated highly together, or other user behaviors, we can formulate a transition probability from one product to another. These probabilities can be represented as a transition matrix below. We want to know the probability distribution of where we may land on as time approaches infinity. These probabilities serve as the ranking in making product recommendations.

We can solve the problem by finding the eigenvectors of the Transition matrix but this is usually hard to scale for real e-commerce systems.

Fortunately, we can usually find this solution within finite steps. If we keep multiplying the transition matrix above, it will converge and in many cases, it converges very fast (we don’t need to perform infinite steps). The vector on the R.H.S. demonstrates the probability of each state we may be in.

The random walk is one of the most popular recommender algorithms in industrial deployments. Nevertheless, it is a huge topic alone but we will look into two major deployments involving the Markov processes.

# Google PageRank

Google PageRank utilizes the inbound and outbound links to a page in ranking their search. (The detail of the PageRank is discussed in the context of eigenvector here.)

By finding the stable state of such a random walk, the corresponding probability of each state can be used for the final ranking. The problem is defined as finding the eigenvectors in the matrix factorization.

In practice, we continue multiple matrix R and wish that it converges fast. The PageRank paper has demonstrated that with 322 million page links, the solution converges to a tolerable limit in 52 iterations. So the solution scales unexpectedly well (details).

Here is an example in comparing Coaches Poll on NCAA basketball ranking versus the one computing with the random walk concept using the Markov process. It is not bad that the computer findings are close to the poll by the coaches.

The transition probability is updated by the following equations. If team *A* wins over team *B* the transition probability increase from *B* to *A*. This will further increase if the game is a blowout.

# Pinterest Pixie

Manipulating such a large matrix is extremely hard. Many researchers focus years-after-years in streamlining matrix operations. Google even has hardware like the Tensor Processing Unit in specializing matrix operations for Deep Learning.

Alternatively, we can approximate the solution. In many ML problems, sampling is one popular approach in probability density estimation. For a huge system requiring an on-demand real-time solution, this may be more feasible. In Pinterest, users can associate a pin to their own board. So Pinterest can build a gigantic graph with 3 billion nodes and 17 billion edges in linking the pins and boards together for the whole site.

At each node, we can randomly select where to go next. Eventually, this process collects many pins that users may visit. The longer the walk, the more choices will be explored but you may risk to include less relevant items if you cannot run the random walk for too long. We can repeat the process from the starting pin many times to form an approximation on the visiting frequency on the pins for a Markov process. The following is the algorithm:

Another possibility is using a biased weight in selecting nodes. For example, boards with fewer pins may be boosted to indicate that the pins in them may be more focused and more important than boards with a lot of pins. The weight can also be biased based on users’ features including geographical and gender information. Like many engineering solutions, the adopted solution will involve many trial-and-error, heavy experiments, and tuning. For example, new pins can receive a higher boost in the random walk.

Here is the general algorithm used in Pixie on Pinterest.

Here, it applies early stopping as well as using multiple pins *Q* to seed the random walk. Pixie allows multiple query pins with different weights. For example, the weight is higher for more recently interacted pins. The additional pins capture the context of the user’s previous behavior. For example, it may capture a sequence of interacted pins during a search. As shown above, V[*p*] is not summed linearly. Instead, the count *V* will be boosted if multiple initial pins (query pins) lands on the same pin (using a quadratic function).

As mentioned, some boards are less focused. It is wise to purge some that are overboard in topics. As quoted in the research paper:

We approach the problem of graph pruning as follows. First, we quantify the content diversity of each board by computing the entropy of its topic distribution. We run LDA topic models on each pin description to obtain probabilistic topic vectors. We then use topic vectors of the latest pins added to a board as input signals to compute board entropy. Boards with large entropy are removed from the graph along with all their edges.

In short, it applies LDA to analyze the topic distribution of boards and purge those that cover too many topics. Also, there are popular pins that everyone repined. So it may be wise to systematically discarding edges of high degree pins. In Pixie’s research paper,

However, rather than discarding the edges randomly we discard edges where a pin is miscategorized and does not belong topically into a board.

By pruning the graph, the graph for the whole site may fit into the memory of a single server. Without cross-server communication, it can meet the on-demand requirement on latency.

Again, engineering solutions are usually different from academic research. Here is one of the popular architecture of the recommender system.

First, instead of a single algorithm, multiple algorithms and sources are used to generate candidates as relevant items. For example, Pixie can build a separate graph based on who you followed to generate candidates. So there may be a pin-to-board graph and user-to-pin graph. Then, Pixie combines the candidates and rank them. At last, to avoid similar items clustering together, a blender is applied to create some slight randomness of the display.

Let’s focus on the ranking model. As deep learning is more mature, the system can build a deep network to predict the likelihood of interacting with a pin. First, Pixie extracts features for the candidates and uses a fully connected network in making predictions.

To generalize the solution, Pixie is trained to predict multiple tasks including repin, click, long-click and close-up. This avoids the network to be overfitted for a specific task and make the learned model more generalize. The following is the weighted lost function combining all four tasks.

# Association rules

Many years ago, a finding suggested that people buying diapers are likely to buy beer also. In this section, we will go through some of the concepts that may draw such conclusions.

**Apriori Algorithm**

Transaction data shows a lot of user behaviors and patterns. By making associations between items in millions of transactions, we can design deals, product placements, and recommendations better based on the data.

Let’s define a few terms first. **Support ***S*** **is the probability of events (say what is the probability of purchasing S₁, S₂, …S*k *∈ *S).*

**Confidence **is how often *T* happens when *S* happens.

In **association**, we are looking for high support and high confidence.

For example, an e-commerce company may display associated items, as related products, for the current items in your shopping cart through association analysis.

First, we start with the transaction records of customers. Starting from one item, we compute the corresponding support. (For simplicity, we use the count instead of the probability in our examples.) But to avoid exponential growth of combinations, entries with count smaller or equal than a threshold will be eliminated.

Next, we create 2-item combinations with the remained items on the left above. This stops our tree to grow exponentially.

Then, we continue to build 3, 4 and 5 item-combination unless there are no more items left to build. After this exercise, we create the frequent itemsets (*A, P, K, M, AP, AM, PM, KM & APM*).

Our next task is to create association rules. For the combination *APM*, the possibilities are

But as the number of items grows, this combination grows exponentially also. So we need to prioritize which itemset to start our investigation first. Usually, we start with the itemset with the highest confidence first.

Nevertheless, confidence alone can be misleading. Confidence *P(Y|X)* can be high simply because *Y* is popular. We want to know the additional confidence in seeing *Y* given *X*. **Lift** recalibrates the confidence by dividing it with *P*(Y).

If the lift is greater than one, users will be more likely to purchase itemset 𝑌 given they buy itemset 𝑋, or the opposite if it is smaller than one. Conviction is the expected frequency that *X* occurred while *Y* did not happen. A value of 1.3 means 30% of the prediction is wrong if *X* and *Y* happen together by chance only.

Let’s have an example of applying the Apriori Algorithm in forming association rules. The example below tries to find any associations for the responses in a 14 question survey. For questions with categorical answers, each possible answer is treated as a separate item. However, for an ordinal answer, this example treats it as either smaller than the median, or greater or equal with the median. So the possible events are like (male, single, single, married, divorced. income < $40,000, income ≥ $40,000, …).