Many teams try to fix weak answers by switching models. In production systems, the bigger issue is usually retrieval quality. If the context sent into the model is incomplete, stale, or loosely related, even the best model cannot recover.
The Core Failure Pattern
A user asks for a specific person, policy code, or document identifier. Pure vector search returns semantically related text, but misses the exact chunk carrying the answer. The response looks confident but drifts from the source.
Hybrid Retrieval Beats Pure Vector
Hybrid retrieval combines dense similarity with lexical matching. Dense vectors capture intent, while keyword signals preserve exactness for names, IDs, and compliance language.
async def hybrid_search(query, top_k=10):
keywords = extract_keywords(query)
vector_hits = await search_dense(query, limit=top_k * 2)
lexical_hits = await search_lexical(keywords, limit=top_k)
return rerank(vector_hits, lexical_hits)[:top_k]
Chunking Strategy Is a Quality Lever
Fixed-size chunks often cut across logical boundaries. Keep sentence or paragraph boundaries and use overlap so linked facts remain available when queries land near edges.
def chunk_text(text, max_chars=900, overlap=220):
# Preserve sentence boundaries and retain overlap
...
Rewrite Ambiguous Queries
Queries like "what about that regulation" lack searchable detail. Rewriting with conversation context creates clearer retrieval targets, which improves both recall and rank position.
Measure Before You Tune
Track retrieval metrics such as Recall@K and MRR on a fixed evaluation set. This gives objective feedback when adjusting chunk size, reranking, filters, or fusion parameters.
Takeaway
Retrieval architecture is the performance foundation of RAG. Strong model output starts with precise, high-relevance context. Improve search first, then optimize generation.