Hybrid Retrieval and Semantic Search in RAG: Building Smarter Document Search Systems
Introduction: Why Your RAG System Is Failing at Retrieval
You've built a Retrieval-Augmented Generation (RAG) system. You picked a good LLM, fine-tuned the prompts, optimized the context window. But something's still off—your system gives confident-sounding answers that are subtly wrong, or it misses relevant information entirely.
Here's the uncomfortable truth: your retriever is the bottleneck, not your generator.
Most developers focus optimization efforts on the language model—tweaking prompts, experimenting with different model sizes, adjusting temperature and top-k parameters. But here's what research consistently shows: a mediocre retriever paired with a powerful LLM will always lose to an excellent retriever paired with a smaller LLM. If the right documents never reach the generator's context window, no amount of model sophistication can save you.
The core problem with traditional RAG is oversimplification: most systems rely on a single retrieval method—either keyword search (BM25) or dense vector similarity (embeddings). Each approach has fundamental blind spots:
- Keyword search nails exact term matches but fails when users ask the same question using different vocabulary
- Vector search captures semantic relationships beautifully but can hallucinate relevance for documents that merely sound related
- Neither method alone handles the full complexity of real-world information retrieval
This is where hybrid retrieval enters the picture. By intelligently combining multiple retrieval strategies, you can overcome the limitations of any single approach and dramatically improve both retrieval accuracy and downstream answer quality.
This article walks you through the complete landscape of hybrid retrieval and semantic search techniques that will make your RAG system actually work.
Part 1: The Three Core Retrieval Indices
Before we blend approaches, let's understand what we're blending. There are three fundamental retrieval paradigms, each with distinct strengths and weaknesses.
BM25: The Keyword Foundation
BM25 (Best Matching 25) is the industry standard for lexical search. It's been around since 1994, and it works because it elegantly captures how documents relate to queries at the term level.
How BM25 works:
python# Simplified BM25 scoring def bm25_score(query_terms, document, avg_doc_length, k1=1.5, b=0.75): """ Calculate BM25 score for a document given query terms. Args: query_terms: List of query words document: Dictionary with term frequencies and length avg_doc_length: Average document length in corpus k1, b: Tuning parameters (standard values shown) """ score = 0 doc_length = document['length'] for term in query_terms: # Inverse document frequency: penalizes common terms idf = math.log((total_docs - docs_containing(term) + 0.5) / (docs_containing(term) + 0.5) + 1) # Term frequency with length normalization term_freq = document['term_freq'].get(term, 0) numerator = term_freq * (k1 + 1) denominator = (term_freq + k1 * (1 - b + b * (doc_length / avg_doc_length))) score += idf * (numerator / denominator) return score
Why BM25 excels:
- Exact keyword matching with sophisticated term weighting
- Length normalization prevents bias toward longer documents
- Works with zero training—just index and search
- Computationally efficient (milliseconds for large corpora)
- Transparent: you can understand why a document ranked highly
Why BM25 struggles:
- Completely misses semantic relationships (synonyms get zero credit)
- Query expansion required for coverage ("car" won't find "vehicle")
- One typo kills matching
- Rare technical terms get over-weighted despite low semantic relevance
Dense Vector Search (KNN): The Semantic Approach
Dense vectors represent documents and queries as points in high-dimensional embedding space. Documents with similar meanings cluster together mathematically.
pythonfrom sentence_transformers import SentenceTransformer import numpy as np # Initialize a pretrained sentence transformer model = SentenceTransformer('all-MiniLM-L6-v2') # Encode documents and query into dense vectors documents = [ "The cat sat on the mat", "A feline rested on the carpet", "The stock market crashed today" ] query = "Where did the cat rest?" # Get embeddings (384-dimensional vectors) doc_embeddings = model.encode(documents) query_embedding = model.encode(query) # Calculate cosine similarity from sklearn.metrics.pairwise import cosine_similarity similarities = cosine_similarity([query_embedding], doc_embeddings)[0] # Rank by similarity ranked = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True) for doc, score in ranked: print(f"{score:.3f}: {doc}")
Output:
0.847: A feline rested on the carpet
0.823: The cat sat on the mat
0.124: The stock market crashed today
Why dense search excels:
- Captures semantic meaning independent of exact vocabulary
- Handles synonyms, paraphrases, and conceptual variations naturally
- Works across languages (with multilingual models)
- Generalizes to out-of-vocabulary terms
- State-of-the-art performance on many benchmarks
Why dense search struggles:
- Computationally expensive (thousands of comparisons for large corpora)
- Prone to semantic drift: similar-sounding documents about unrelated topics
- Opaque—you can't easily explain which terms contributed to a match
- Requires large vector storage (8KB+ per document with modern models)
- Embedding quality depends entirely on training data; domain-specific embeddings often outperform general ones
Sparse Encoder Search (ELSER): The Semantic-Keyword Bridge
Sparse encoders like Elasticsearch's ELSER represent a fascinating middle ground. They're trained to expand query terms contextually while maintaining interpretability.
Think of ELSER as teaching the system that when someone searches for "car," documents mentioning "vehicle," "automobile," and "transportation" are semantically relevant—but instead of representing this as an opaque 384-dimensional vector, it expands the query itself.
python# Conceptual example of sparse encoding # (ELSER works internally, but here's the idea) def sparse_encode_query(query_text, expanded_terms): """ Expand query based on learned semantic relationships. """ original_terms = query_text.lower().split() # ELSER learns these expansion patterns from training data expansion_map = { 'car': ['vehicle', 'automobile', 'motor'], 'effective': ['efficient', 'productive', 'successful'], 'recent': ['recent', 'latest', 'current'] } expanded = set(original_terms) for term in original_terms: if term in expansion_map: expanded.update(expansion_map[term]) return expanded # Query: "effective cars" # Becomes: {"effective", "cars", "efficient", "productive", # "successful", "vehicle", "automobile", "motor"}
Why sparse encoders excel:
- Combines keyword matching precision with semantic understanding
- Interpretable: you can see which expanded terms matched
- Computationally efficient (sparse operations, not dense vectors)
- Works well with existing keyword-based infrastructure (Elasticsearch, Solr)
- Specialized dense-to-sparse training improves semantic matching
Why sparse encoders struggle:
- Less mature technology than BM25 or dense vectors
- Requires specialized infrastructure (Elasticsearch 8.0+)
- Performance depends on expansion quality from training
- Not yet as universally adopted as other methods
Part 2: Blended RAG Architecture in Action
Now that we understand each retrieval method's strengths, let's see why combining them dramatically improves results.
The Research Evidence: Concrete Performance Gains
The research team at Sawarkar et al. tested this empirically using their Blended RAG framework. Here's what they found:
Notice the architecture shows multiple parallel retrieval pathways—this is the key insight. Rather than choosing one retrieval method, we run them simultaneously and intelligently combine their results.
Let's look at the concrete numbers:
| Model/Pipeline | EM | F1 | Top-5 | Top-20 | | --- | --- | --- | --- | --- | | RAG-original | 28.12 | 39.42 | 59.64 | 72.38 | | RAG-end2end | 40.02 | 52.63 | 75.79 | 85.57 | | BlendedRAG | 57.63 | 68.4 | 94.89 | 98.58 |
Source: "Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers"
That's a 45% improvement in exact match (EM) over the previous state-of-the-art. Not a marginal optimization—a fundamental transformation.
Multi-Match Query Types: The Technical Foundation
The secret sauce is using different query formulations against the same indices. In Elasticsearch terminology, these are called "multi_match" query types. Each treats term relationships differently:
1. Best Fields (Optimal for ELSER)
python# Best fields: aggregate scores within individual fields # Useful when a term appearing fully in one field is better than spread across multiple elastic_query = { "multi_match": { "query": "machine learning fundamentals", "type": "best_fields", "fields": ["title^2", "content", "summary"], # title weighted 2x "operator": "or" # documents matching ANY term ranked } } # A document with "machine learning fundamentals" in the title gets highest score # Score: best match from any single field wins
2. Most Fields (Balanced approach)
python# Most fields: appearance across different fields boosts relevance # If "learning" appears in title AND content, it's weighted higher elastic_query = { "multi_match": { "query": "machine learning", "type": "most_fields", "fields": ["title^2", "content", "metadata"], "operator": "and" # documents matching ALL terms ranked higher } } # A document with "machine" in title + "learning" in content scores high
3. Cross Fields
python# Cross fields: treats all fields as one when calculating term frequency elastic_query = { "multi_match": { "query": "author Chollet", "type": "cross_fields", "fields": ["author^2", "content"] } } # Useful for author-book queries where term relevance spans fields
4. Phrase Prefix
python# Phrase prefix: emphasizes phrase matches and prefix completion elastic_query = { "multi_match": { "query": "deep learning optim", "type": "phrase_prefix", "fields": ["title", "content"] } } # Matches: "deep learning optimization" (phrase with prefix completion) # Not: "optimization of deep learning" (wrong order)
The breakthrough insight: different indices perform better with different query types. BM25 performs optimally with best_fields, while dense search prefers most_fields, and sparse encoders excel with best_fields formulations.
Practical Implementation: Building Your Hybrid Retriever
Let's build a working hybrid retrieval system using Python and Elasticsearch:
pythonfrom elasticsearch import Elasticsearch from elasticsearch.helpers import bulk import numpy as np from sentence_transformers import SentenceTransformer class HybridRetriever: def __init__(self, es_host="localhost:9200"): self.es = Elasticsearch([es_host]) self.encoder = SentenceTransformer('all-MiniLM-L6-v2') self.index_name = "hybrid_documents" def create_index(self): """Create Elasticsearch index with all three retrieval types.""" index_config = { "settings": { "number_of_shards": 1, "number_of_replicas": 0, "analysis": { "analyzer": { "default": { "type": "standard", "stopwords": "_english_" } } } }, "mappings": { "properties": { "id": {"type": "keyword"}, "title": { "type": "text", "analyzer": "default" }, "content": { "type": "text", "analyzer": "default" }, # Dense embedding for KNN search "embedding": { "type": "dense_vector", "dims": 384, "index": True, "similarity": "cosine" }, # Sparse vector for ELSER-style search "elser_embedding": { "type": "sparse_vector" } } } } if self.es.indices.exists(index=self.index_name): self.es.indices.delete(index=self.index_name) self.es.indices.create(index=self.index_name, body=index_config) def index_documents(self, documents): """Index documents with all three retrieval modalities.""" actions = [] for doc in documents: # Generate dense embedding embedding = self.encoder.encode(doc['content']).tolist() action = { "_index": self.index_name, "_id": doc['id'], "_source": { "id": doc['id'], "title": doc['title'], "content": doc['content'], "embedding": embedding, # In production, ELSER embedding would be generated server-side "elser_embedding": {} } } actions.append(action) bulk(self.es, actions) self.es.indices.refresh(index=self.index_name) def hybrid_search(self, query, k=10): """ Execute parallel searches using all three methods and combine results using Reciprocal Rank Fusion. """ # 1. BM25 Search (keyword-based) bm25_query = { "multi_match": { "query": query, "type": "best_fields", "fields": ["title^2", "content"], "operator": "or" } } bm25_results = self.es.search( index=self.index_name, body={"query": bm25_query, "size": k} ) # 2. Dense Vector Search (semantic) query_embedding = self.encoder.encode(query).tolist() dense_query = { "knn": { "field": "embedding", "query_vector": query_embedding, "k": k, "num_candidates": k * 3 } } dense_results = self.es.search( index=self.index_name, body={"query": dense_query, "
Share this article
Related Articles
Deep Learning 101: From Foundations to Real-World Applications
A deep dive into Deep learning for AI engineers.
Machine Learning Models 101: From Theory to Practice
A deep dive into Machine Learning Models for AI engineers.
Cosine Search and Cosine Distance in RAG: The Foundation of Semantic Retrieval
A deep dive into Cosine Search and Cosine Distance in RAG for AI engineers.

