The concept of “similarity” is something that content-based recommendation systems and search engines have in common. Running a full text search on a search engine is basically just asking it to give you the documents that are most similar to the query. Just like how content-based recommendation systems give you products that are most similar to the current one. Fullscript’s Eureka pod is stacked with experts in search; specifically, Elasticsearch. So to us, automating our similar products recommendation system with Elasticsearch was almost a no-brainer.

Since we were building a recommendation system for wellness products such as supplements, and we are just software engineers, this project started out with us consulting our Internal Medical Advisory Team (IMAT). Before this project, our similar product recommendations were powered by the knowledgeable people on IMAT who manually and painstakingly curated similar product lists for as many products in our catalog as they could. IMAT told us what data points they looked at to decide which products belong in a particular product’s recommendation list so we could try to replicate the logic in our Elasticsearch query(ies). It turned out that they mostly looked at the ingredient makeup of the product and the dosages of ingredients, as well as how products are marketed in their names and descriptions.

A big challenge for us in this project was going to be ingredient dosages, which are numerical values with units. Search engines are good at evaluating and scoring similarity of documents of text, and we already knew about the more like this (MLT) query, so ingredient makeup and marketing text was going to be easier to compare than ingredient dosages. Thankfully, we discovered the decay function score query for ingredient dosage comparisons, which decays a document’s score based on the distance of a numerical field from an origin

All of this information that IMAT bestowed on us suggested that we would need multiple Elasticsearch queries whose results somehow get combined for our final result set. We were actually able to combine all these queries into one large query containing many subqueries. That large query was appropriately (and creatively) dubbed “The Similar Products Query”

The Similar Products Query

When a user visits a product page on Fullscript, we run a single query to get the similar product list for a particular product, which from now on we will call the “source product”. The Similar Products Query is made up of several MLT and decay dose subqueries. Each one of the MLT and decay sub queries are wrapped in a script score query, which gives us the ability to normalize subquery scores and gives us full control over the subqueries weight on a document’s total score. Each subquery is assigned a constant BOOST_FACTOR value which determines the subquery’s weight on a document’s total score. All of the subqueries are wrapped in a boolean query should clause, which is a kind of logical OR operator in Elasticsearch.

Lastly, the Similar Products query has a must_not clause in the boolean wrapper which excludes the source product itself from the result set. The main part of the Similar Products Query looks something like this:

More Like This (MLT) Subqueries

Elasticsearch’s more like this (MLT) query uses its normal relevance scoring to find documents most similar to an existing document in the cluster for a field or set of fields. We used this to create ingredient list subqueries and one sub query to capture how the product is marketed. A single MLT subquery looks something like this:

Out of the box, the MLT query does not give a score out of 1, where 1 is a perfect match, which is something we needed to make sure we had full control of the weight of each subquery. To solve this problem, we implemented a method in an abstract class called perfect_score, which runs the same MLT subquery it is about to run as a stand-alone query, but instead of excluding the source product from the results, it includes it and limits the result set to 1 product (which will be the source product). The score that the result produces is then considered the “perfect score” which all scores will be divided by to get a ratio score out of 1.

The fields that the MLT subqueries search against are all indexed as text fields. The ingredient list fields get indexed as a space-delimited string of ingredient IDs. It is done this way so that Elasticsearch’s relevance scoring works properly. We index ingredient IDs rather than ingredient names, because an ID is a single unique token to represent an ingredient object, and ingredient names can share words such as “extract” which would make scoring less predictable.

Dose/Decay Subqueries

In order to score ingredient dose similarity, we used the gauss decay function query. Elasticsearch’s decay functions score documents out of 1 with a function that decays the score depending on the distance of a numeric field value of the document from a given origin, which in our case is the dose of an ingredient. Decay functions have 4 inputs:

origin: The point of origin used for calculating distance. In our case this is the dose of an ingredient of the source product
offset: The decay function will only compute the decay function for documents with a distance greater than the offset. This means that if an ingredient has a dose whose distance is less than the offset from the origin dose, it will be scored as if it is the same dose with a perfect score of 1. We have this defined to be 10% of the origin dose.
scale: Defines the distance from origin + offset at which the computed score will equal decay parameter. We have this defined to be 25% of the origin dose.
decay: The decay parameter defines how documents are scored at the distance given at scale. We have not defined this anywhere, so Elasticsearch uses a default value of 0.5

Image from: Similar Products: Using Elasticsearch to Build a Content-Based Recommendation System

The decay function works for only one ingredient dose. Since products are made up of many ingredients, our dose/decay subqueries are actually made up of as many nested decay functions as there are ingredients with dosages in the source product, all wrapped in a should. A dose/decay subquery after it is built will look like this:

Unlike MLT queries, decay functions do return a score between 0 and 1, with 1 being a perfect match. But since we are running more than one decay function query, and should sums the scores of the queries within it, we need to divide the total score by the number of decay queries (Which is also the number of ingredient group doses for the source product) before multiplying by the BOOST_FACTOR.

For the dose subqueries to work, the fields they search against need to be indexed as a nested field type. Every dose needed to be converted at index time to a common unit per ingredient as well. Here is what a nested dose field mapping looks like:

And here is what the indexed data looks like:

Performance Considerations

The Similar Products Query has a lot going on, and with it, we are asking Elasticsearch to do more than we normally do with a search engine query, especially when a source product has a large number of ingredients. With that said, the Similar Products Query is actually quite fast and does return results in a reasonable amount of time. Even so, we were concerned that a large request to our Elasticsearch cluster every time a user navigates to a product page would be a lot for it to handle on top of searching and indexing requests, and that everything in our app that uses Elasticsearch would suffer because of that.

Luckily for us, the results of a Similar Products query for a given source product will not change very often. We decided to cache the IDs of the results for 24 hours with a product-specific key in our Redis cache in order to limit the number of these requests going to Elasticsearch. This speeds up our average response time on product pages, as well as reduces the load on our Elasticsearch and app servers, which is important as we grow and scale.

Next Steps

Our first iteration on “The Similar Products Query” was a success. After a couple rounds of testing, tuning boost values, and verifying result sets with some of our teammates on IMAT, we got an enthusiastic green light to release this project. Take a look at a few examples in our open catalog to see our Similar Product Query in action:

Fullscript | Cortisol Manager The ingredients in Cortisol Manager support healthy cortisol levels.* Cortisol is one of the most important hormones…us.fullscript.comFullscript | Cortisol ManagerThe ingredients in Cortisol Manager support healthy cortisol levels.* Cortisol is one of the most important hormones…

Fullscript | Active B-Complex Integrative Therapeutics ACTIVE B-COMPLEX B-VITAMINS PLAY ESSENTIAL ROLES IN MANY FUNCTIONS OF THE BODY Optimal folate…us.fullscript.comFullscript | Active B-ComplexIntegrative Therapeutics ACTIVE B-COMPLEX B-VITAMINS PLAY ESSENTIAL ROLES IN MANY FUNCTIONS OF THE BODY Optimal folate…

Fullscript | Liposomal Vitamin C Dr. Mercola Premium Products Vitamin C is one of the most powerful and versatile of all the essential nutrients. It…us.fullscript.comFullscript | Liposomal Vitamin CDr. Mercola Premium Products Vitamin C is one of the most powerful and versatile of all the essential nutrients. It…

With that said, this first iteration was more of a minimal viable product. Even though results met expectations, there is still room for improvement. Here are a few things that we plan on doing in the future to make our similar product recommendations even better:

Verify that our structured ingredient data is complete and accurate, ideally through automation
Create a subquery to find products with a similar price per dose
Create a subquery to find products targeting the same demographic

We also have products in our catalog that do not have ingredients, like wipes, that the Similar Product Query will not work for. These products will also need a way to retrieve similar product recommendations.

Conclusion

Elasticsearch does a lot of things very well, and content-based recommendation systems is just another thing we can add to that list. We really enjoyed building this project and learning about all the features Elasticsearch has to offer that helped us with our final product, especially the decay function and script_score. Now that this project is released, there is no need for IMAT to manually enter similar product lists and they can focus on other important work.

Thank you for reading this post! If you are taking on building a content-based recommendation system I highly recommend considering using Elasticsearch and the methods described above.

Similar Products: Using Elasticsearch to Build a Content-Based Recommendation System

Share this post