Developing a Modern Search Stack: Personalization | Builder's Corner

This article is part of a series from Fullscript about how we built our modern search stack. You can find the other articles here.

In part 5 of this blog series, we described how we added a vector model to our search stack, which gave us the power of hybrid search - balancing lexical and semantic matching. This helped to make our search results much more relevant to a user’s query, but we still had one key issue. Two users with very different purchase histories could perform the same search and their results would be in the exact same order. We needed to add personalization.

How Our Personalization Service Works

Although much of our codebase is written in Ruby, we chose to write our personalization service in Python, given the libraries that are available. We used the implicit library as it is a very popular and well supported for this type of task. Within it, we chose the Bayesian Personalized Ranking (BPR) method as we care more about the relative ordering of products than the absolute “personalization score” that you would get from an Alternating Least Squares (ALS) model. Since we assume that the products in our search results are all relevant (there are still some false positives, but it’s a reasonable assumption), it’s better for us to mirror the logic of our LTR model, which is also trained to optimize the relative ordering of products in a result set. By having both our LTR and personalization services using the same optimization logic, we can better compare apples-to-apples when we combine their results.

Combining LTR and Personalization Results

To combine the results from our LTR and personalization services, we first apply a min-max transformation to their scores. This puts the two sets of scores on the same scale, which makes it easier to combine them. After the transformation, we use a weighted average to balance the importance of the LTR and personalization scores. We found that giving the LTR scores greater weight resulted in better rankings.

It’s worth noting that the scores outputted by our personalization and LTR services are only relevant to that specific result set. Since we are using BPR for personalization and an XGBoost reranker optimized for NDCG for our LTR service, the scores from the two services cannot be compared across user queries. This is because the scores are calculated based on the products within that result set. Given that we are using “local” scores, we need to optimize the personalization service and calculate the optimal weighted average, using this local information.

Examples of the scores’ distributions can be seen below. In the first example, there are some products with very high personalization scores, which lead to high weighted average scores when combined with the LTR scores. In the second example, the maximum score from the personalization service is much lower. This is because we do not have products that are as well suited to the user compared to the other products in the query (remember that scores are computed relative to the other products in the result set). This results in a more gaussian distribution for the weighted average scores, with a lower maximum score.

Training and Evaluation

The training data for a personalization model consists of users viewing products’ information pages, adding a product to their cart, and purchasing a product. The evaluation data is limited to just the add-to-carts and purchases since it’s more important for our business that we optimize for those. This makes the evaluation data a bit more sparse, but it is focused on our key metrics.

After some testing, we found that using six months of data was best. This keeps the data fresh and given the purchasing cadence of users, we have multiple data points for many users. When optimizing the model, we use the most recent month for evaluation and the prior five months are for training. This ensures that we won’t have any data leakage between the training and testing datasets. However, before we deploy a model, we train it using all six months of data to ensure that it has the most recent data available to it. A model is retrained and deployed everyday to ensure freshness of data.

When we optimize a model, we leverage the Python library Optuna. This library efficiently samples values for the model’s hyperparameters and different weighted average values to help us achieve the best results. By optimizing a model’s hyperparameters and the weighted average with the LTR service at the same time, we can create the best personalization model that works with our LTR service, rather than the best overall personalization model. Our primary metric for this performance is Mean Reciprocal Rank (MRR), but we also inspect Normalized Discounted Cumulative Gain (NDCG), Recall@K, Precision@K, and a few others.

One last feature of our evaluation is the harm penalty. To avoid the personalization service altering the results too much, we penalize changes that hurt our search results more than changes that improve our search results. This is a conservative term, but we think it’s a better user experience for a relevant personalized product to be ranked lower than an irrelevant personalized product to be ranked higher.

Challenges

When we first developed the personalization service, we combined the raw LTR and personalization scores (rather than using the min-max transformation). Although this worked well initially, when we made changes to both our LTR and personalization services, the new scores had different distributions than the original scores that we optimized the weighted average for. This meant that the final ranking produced by the new scores struggled to outperform the original scores. We needed to improve the coupling between services. By applying the min-max transformation, the scores’ ranges remained constant while we iterated on both services.

Results and Next Steps

Once again, our search metrics made a notable improvement. There were some great examples where a product that a user often purchased was bumped to the first spot. We still have some ideas to improve personalization. We could try deep learning-based models to create richer embeddings. We could also improve the coupling between personalization and LTR by developing another ranking service that uses additional data and machine learning to produce even more relevant rankings.

Developing a Modern Search Stack: Personalization, Making Search Results Feel Less Generic

Share this post

How Our Personalization Service Works

Combining LTR and Personalization Results

Training and Evaluation

Challenges

Results and Next Steps

Share this post

Related Posts

Using an LLM as a Test Compiler

Feature-Flagging a Major Design System Upgrade