
Building RAG Pipelines at Scale
A deep dive into production RAG pipelines — vector databases, chunking strategies, embedding models, and evaluation frameworks for reliable AI-powered retrieval.
Why RAG Matters for Production AI
Retrieval-Augmented Generation (RAG) has emerged as the most practical pattern for building AI applications that need access to private or up-to-date information. Unlike fine-tuning, RAG lets you ground LLM responses in your own data without retraining the model.
But building a RAG pipeline that works reliably in production is far more nuanced than the tutorials suggest. Here's what I've learned deploying RAG systems that serve thousands of queries daily.
The Architecture
A production RAG pipeline has five core stages:
- Document Ingestion — Loading and parsing documents from various sources
- Chunking — Splitting documents into semantically meaningful pieces
- Embedding — Converting chunks into dense vector representations
- Retrieval — Finding the most relevant chunks for a query
- Generation — Synthesizing an answer from retrieved context
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
# Chunking strategy matters more than most people think
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(documents)Chunking: The Most Underrated Decision
Most RAG tutorials use fixed-size chunks of 1000 characters and move on. In production, your chunking strategy directly impacts retrieval quality.
What I've Found Works
- Smaller chunks (256-512 tokens) for factual Q&A — more precise retrieval
- Larger chunks (1024-2048 tokens) for summarization — more context per retrieval
- Semantic chunking for heterogeneous documents — respects natural boundaries
- Parent-child chunking for the best of both worlds — retrieve on small chunks, return parent context
# Parent-child retrieval pattern
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=InMemoryStore(),
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)Embedding Model Selection
The embedding model you choose is the second-most impactful decision. Here's my benchmark across three production datasets:
| Model | MTEB Score | Latency (ms) | Cost |
|---|---|---|---|
text-embedding-3-large | 64.6 | 45 | $0.13/1M tokens |
text-embedding-3-small | 62.3 | 30 | $0.02/1M tokens |
voyage-large-2 | 65.1 | 50 | $0.12/1M tokens |
bge-large-en-v1.5 | 63.4 | 15 (local) | Free |
For most production cases, I recommend text-embedding-3-small — it offers 95% of the quality at 15% of the cost of the large model.
Evaluation: The Missing Piece
The hardest part of RAG isn't building it — it's knowing if it works. Here's the evaluation framework I use:
from ragas import evaluate
from ragas.metrics import (
answer_relevancy,
faithfulness,
context_precision,
context_recall,
)
result = evaluate(
dataset=eval_dataset,
metrics=[
answer_relevancy,
faithfulness,
context_precision,
context_recall,
],
)
print(result)
# {'answer_relevancy': 0.92, 'faithfulness': 0.87, ...}Key Metrics to Track
- Context Precision — Are the retrieved chunks actually relevant?
- Context Recall — Did we retrieve all the chunks needed to answer?
- Faithfulness — Does the answer stick to what's in the context?
- Answer Relevancy — Does the answer actually address the question?
Real Production Numbers
After deploying RAG across three different production systems, here are the results worth sharing:
- Hallucination rate dropped from ~40% to ~5% after switching from pure LLM generation to RAG with faithfulness guardrails.
- Retrieval latency: p50 = 45ms, p99 = 180ms using Pinecone with
text-embedding-3-smallon a dataset of ~2M chunks. - Cost savings of ~60% by switching from
text-embedding-3-largetotext-embedding-3-smallwith negligible quality loss on our internal benchmarks. - Monthly embedding spend: ~$120/month for a corpus of 500K documents, re-indexed weekly.
These numbers aren't universal, but they give you a realistic baseline for what to expect.
When NOT to Use RAG
RAG is powerful, but it's not a silver bullet. Here are the cases where other approaches actually work better:
- Highly structured, static knowledge — If your data fits in a system prompt (under ~50K tokens), just put it there. RAG adds latency and complexity for no benefit.
- Creative or generative tasks — Writing marketing copy, brainstorming, or code generation don't benefit from retrieval. The LLM's parametric knowledge is sufficient.
- When the answer requires deep reasoning over an entire document — RAG retrieves chunks, not whole documents. For tasks like "summarize this 200-page legal contract," consider long-context models (Gemini 2.5 Pro with 1M context) instead.
- When you need deterministic outputs — RAG introduces variability based on which chunks are retrieved. If you need the exact same answer every time, consider a structured lookup or traditional search.
Rule of thumb: If your use case involves answering questions about proprietary, frequently-updated data that doesn't fit in a prompt, RAG is the right choice. For everything else, evaluate simpler alternatives first.
Production Lessons
After deploying several RAG systems, here are the non-obvious lessons:
- Cache embeddings aggressively — Embedding the same document twice is wasted compute and money
- Use hybrid search — Combine dense (vector) and sparse (BM25) retrieval for best results
- Implement fallback strategies — When retrieval confidence is low, say "I don't know" instead of hallucinating
- Monitor drift — Your documents change, your queries change. Re-evaluate monthly.
- Chunk metadata matters — Include source, section headers, and dates in your chunks for better filtering
What's Next
In my next post, I'll cover advanced retrieval patterns — query decomposition, HyDE, and multi-hop retrieval for complex questions. If you're interested in how I structure my Python backends or how AI agents are changing library ecosystems, check those out too.
Want to discuss RAG architecture or collaborate on an AI project? Get in touch — I'm always happy to talk shop.