The Problem: LLMs Don't Know Your Data
LLMs are trained on public internet data. They don't know:
- Your company policies
- Your internal documentation
- Your custom data formats
- Recent updates (knowledge cutoff)
So they hallucinate. They make things up. Users ask: "What's our refund policy?" LLM makes up a policy.
Solution: RAG (Retrieval-Augmented Generation). Retrieve relevant documents from YOUR knowledge base, then have the LLM answer based on that.
What Is RAG (Really)?
RAG = Retrieval + Generation.
- Retrieval: Find documents relevant to the user's query
- Generation: LLM reads those documents and answers the query
Simple example:
Retrieval: Search knowledge base โ Find "refund_policy.md"
Generation: LLM reads document + generates answer
LLM response: "You have 30 days for a full refund..."
RAG Architecture (4 Components)
1. Knowledge Base (Documents)
Your data. Could be:
- PDFs, Word docs
- Internal wikis, Notion pages
- Database records
- Customer interactions
2. Embeddings (Vector Representations)
Convert documents into vectors (numbers). Semantic similarity = close vectors.
"Returns and refunds process" โ Embedding 2
These are similar โ Will match
3. Vector Database (Search Index)
Store embeddings. When user queries โ Search for similar embeddings โ Retrieve documents.
Examples: Pinecone, Weaviate, Milvus
4. LLM (Generation)
Read retrieved documents + original query โ Generate answer.
Indexing Strategy (How to Chunk)
You can't index 1000-page PDFs as single documents. Too large. Need to chunk them.
Strategy 1: Fixed-Size Chunks
Split into 512-token chunks. Simple, but loses context.
Strategy 2: Semantic Chunks
Split by topic/paragraph. Preserves meaning.
Strategy 3: Hierarchical Chunks
Chunks within chunks. Document > Sections > Paragraphs. Retrieve at appropriate level.
| Strategy | Speed | Accuracy | Best For |
|---|---|---|---|
| Fixed-Size | โกโกโก | โญโญ | Quick POCs |
| Semantic | โกโก | โญโญโญ | Production |
| Hierarchical | โก | โญโญโญโญ | Enterprise |
Retrieval Optimization
Problem: Irrelevant Results
Sometimes search returns irrelevant documents. LLM tries to answer based on wrong docs.
Solution 1: Hybrid Search
Combine semantic search + keyword search.
- Semantic: Find conceptually similar docs
- Keyword: Find docs with exact terms
- Combine: Get best of both
Solution 2: Reranking
Retrieve 50 documents. Rerank by relevance. Use only top 5 for LLM.
Solution 3: Query Expansion
User asks: "Refund process?"
Expand to: "Returns, refunds, money-back guarantee, reimbursement"
Search for all variations
Scaling RAG to Enterprise
Consideration 1: Document Updates
Policies change. How do you update embeddings?
- Real-time: Update on publish
- Batch: Daily/weekly refresh
- Hybrid: Hot documents real-time, others batch
Consideration 2: Latency
Retrieval + Generation takes time:
- Search vector DB: 100-500ms
- LLM generation: 1-5 seconds
- Total: 2-6 seconds (acceptable for chat)
Consideration 3: Cost
Each query involves:
- Embeddings API call (for query)
- Vector DB search
- LLM generation
Cost per query: ~$0.005-0.01 at scale
Key Takeaways
โ Use semantic chunking for accuracy
โ Implement hybrid search (semantic + keyword)
โ Add reranking for relevance
โ Plan for document updates
โ Monitor latency and costs