Developing a Modern Search Stack: Hybrid Search

In part 4 of this blog series, we explained how we developed an NER (Named-Entity Recognition) service to help make our search results more precise. This was a big improvement as it removed many irrelevant products from our search results. However, we still had the issue of some relevant products not being returned to our users (a “recall” issue). Given that our search stack at the time was only using an Elasticsearch query to retrieve products, we were limited to exact and fuzzy string matching (lexical search). Although this works well in many cases, our product catalog consists of supplements that can help people with many health conditions, e.g. inflammation, sleeping, weight management, etc. Given that users can search with many different words for a condition: sleepy, tired, fatigued, exhausted, etc. relevant products might not be returned if a product’s text does not contain a matching token. Although we included common synonyms to help with this recall issue, the true solution required us to return products that matched the user’s intent.

Semantic Search

Unlike lexical search, which is limited to comparing the shared tokens between user queries and our products, semantic search can compare the two based on their shared meaning. This comparison is often done using vector models, which represent texts in multidimensional space. The closer two texts are in the multidimensional space, the more similar their meanings are. This “shared meaning” is often measured by cosine similarity.

Model Selection

After deciding to adopt semantic search, our next challenge was to choose the right model. We evaluated a wide range of pre-trained options, including several SBERT models and others available through Hugging Face, like BioBERT, PubMedBERT, and other clinical-domain variants. Our main goal was simple in theory but difficult in practice: find a model that achieved strong recall, generated embeddings quickly, and was affordable.

Anyone with experience in vector search knows that this balance is not easy to achieve. High-quality embeddings often come from large, slower models, while faster, smaller models typically lose accuracy. Finding the right compromise required experimentation.

Data Selection

To evaluate our candidates, we used our judgment list to build the datasets for training, evaluation, and testing. Our focus was on maximizing Recall@K, considering only results that were labeled as relevant in the judgment list.

To avoid data leakage, we split our data by query rather than randomly, ensuring that no query used for training appeared in the testing set. This made our evaluation more reliable and closer to real-world performance.

Model Testing

We tested a variety of models, from lightweight options such as MiniLM to embeddings generated by large language models. Our goal was to understand how each performed for our specific use case.

The plot below shows a subset of our testing results. While smaller models performed well for speed and cost, they could struggle with nuanced domain terms, such as ingredient and condition pairings. Larger models performed better on these queries but required more resources.

After several rounds of testing, we realized we had enough labeled data and internal expertise to fine-tune our own model. This gave us an opportunity to address the weaknesses of off-the-shelf models and better align the embeddings with our product catalog and domain language. This is particularly relevant for a use case like ours as the vocabulary of our products’ text (e.g. health conditions, medicinal ingredients) is not representative of the training data for most pretrained models. By fine-tuning models, we help them to understand the nuances of our data.

Model Training

We already had the core of our training data from the data selection process, but one key ingredient was still missing: hard negatives. These are examples that are close to being relevant but are not positive labels. They help the model learn sharper distinctions. SBERT provides a built-in hard negative mining utility, which can help to simplify this process.

Once we gathered both the positive and negative samples, the next step was selecting an appropriate loss function. We experimented with several alternatives but ultimately focused on MultipleNegativesRankingLoss (also known as InfoNCE), given the nature of our task and the availability of well-mined negatives. This loss function was particularly well-suited to leverage in-batch negatives efficiently while reinforcing fine-grained ranking distinctions.

For training data construction, positive labels were derived from our internal judgment list, which was built from user interaction signals. Negative samples were composed of a mix of randomly sampled products and hard negatives mined using SBERT’s built-in hard negative mining utility. This combination ensured both broad contrastive coverage (via random negatives) and sharper boundary learning (via hard negatives that were semantically close but incorrect).

Product text representations were created by concatenating multiple structured product fields, including name, brand, description, and other relevant attributes. Each field underwent its own preprocessing before concatenation to ensure consistency and reduce noise. These texts served as the model’s input for embedding generation.

We trained the model for approximately 10 epochs, using early stopping to prevent overfitting. The number of epochs was chosen to allow the model sufficient time to complete the optimization process while monitoring performance to stop training once improvements plateaued.

This approach allowed our model to learn fine-grained differences between similar products, improving the recall of relevant results without sacrificing embedding quality or inference speed.

Inference

We deployed our embedding model into Metarank, which allowed us to generate real-time embeddings for user queries during search. Product embeddings were precomputed and stored in an HNSW index using Elasticsearch's native implementation. This allowed us to quickly return semantic search results, which have a cosine similarity score to describe the match quality of the user query and the returned products.

It’s worth noting that not all search queries fetch semantic results. Certain user queries, such as those for ingredients of a product, only fetch lexical matches since there’s nothing to be gained from semantic search. However, we still calculate the cosine similarity scores between the user’s query and returned products. This is because it’s helpful for our learning to rank models to always know the cosine similarity between a user’s query and the returned products. By avoiding sparsity (i.e. null values) in the training data, the cosine similarity feature becomes more useful. Plus, it encodes valuable information as it partially represents the popularity of a product for a query - since those products are more often the positive label for the query.

Results & What’s Next

Hybrid search dramatically improved our search results for semantically complex queries and the results for queries with certain typos. Both our recall and NDCG metrics made notable gains and users had fewer complaints about needing to type the exact words to find their intended products.

Looking ahead, we’ll be building a fully automated retraining pipeline, to keep our model fresh and our results sharp.

Share this post