← Back to all posts

RAG Retrieval Architecture

Why your LLM is usually not the bottleneck, but your retrieval layer is.

Many teams try to fix weak answers by switching models. In production systems, the bigger issue is usually retrieval quality. If the context sent into the model is incomplete, stale, or loosely related, even the best model cannot recover.

Better retrieval raises the answer ceiling for every model you use, while model swaps alone often hide core search issues.

The Core Failure Pattern

A user asks for a specific person, policy code, or document identifier. Pure vector search returns semantically related text, but misses the exact chunk carrying the answer. The response looks confident but drifts from the source.

Hybrid Retrieval Beats Pure Vector

Hybrid retrieval combines dense similarity with lexical matching. Dense vectors capture intent, while keyword signals preserve exactness for names, IDs, and compliance language.

async def hybrid_search(query, top_k=10):
    keywords = extract_keywords(query)
    vector_hits = await search_dense(query, limit=top_k * 2)
    lexical_hits = await search_lexical(keywords, limit=top_k)
    return rerank(vector_hits, lexical_hits)[:top_k]

Chunking Strategy Is a Quality Lever

Fixed-size chunks often cut across logical boundaries. Keep sentence or paragraph boundaries and use overlap so linked facts remain available when queries land near edges.

def chunk_text(text, max_chars=900, overlap=220):
    # Preserve sentence boundaries and retain overlap
    ...

Rewrite Ambiguous Queries

Queries like "what about that regulation" lack searchable detail. Rewriting with conversation context creates clearer retrieval targets, which improves both recall and rank position.

Measure Before You Tune

Track retrieval metrics such as Recall@K and MRR on a fixed evaluation set. This gives objective feedback when adjusting chunk size, reranking, filters, or fusion parameters.

Takeaway

Retrieval architecture is the performance foundation of RAG. Strong model output starts with precise, high-relevance context. Improve search first, then optimize generation.