Building Your First RAG System: Practical Guide with LangChain and Pinecone
Retrieval-Augmented Generation (RAG) has become the standard architecture for building LLM applications with private data. In this intermediate tutorial, we'll build a production-ready RAG system from scratch using LangChain and Pinecone, with performance optimizations and real deployment considerations.
Why RAG? The Knowledge Gap Problem
Large Language Models have impressive parametric knowledge (trained on public data up to a cutoff date), but they lack:
- Private data (company documents, internal wikis)
- Recent information (news, latest research)
- Domain-specific expertise (medical journals, legal precedents)
RAG solves this by retrieving relevant documents and providing them as context to the LLM.
Architecture Overview
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ User Query │────▶│ Document Store │────▶│ Embeddings │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Query Embed │────▶│ Vector Database │────▶│ Top-K Docs │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ LLM + Context │◀────│ Prompt Engine │◀────│ Retrieved Docs │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐
│ Final Answer │
└─────────────────┘
Prerequisites
1# Install required packages
2pip install langchain langchain-community langchain-pinecone pinecone-client openai
3pip install sentence-transformers pypdf python-dotenv
4
5# For production monitoring
6pip install langsmith ragas
7
Step 1: Document Processing Pipeline
Document Loading and Chunking Strategies
1from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
2from langchain.text_splitter import RecursiveCharacterTextSplitter
3import os
4
5# Load documents from directory
6def load_and_chunk_documents(directory_path: str):
7 loader = DirectoryLoader(
8 directory_path,
9 glob="**/*.pdf",
10 loader_cls=PyPDFLoader,
11 show_progress=True
12 )
13
14 documents = loader.load()
15
16 # Intelligent chunking with overlap
17 text_splitter = RecursiveCharacterTextSplitter(
18 chunk_size=1000, # Optimal for most embeddings
19 chunk_overlap=200, # Preserve context across chunks
20 length_function=len,
21 separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
22 )
23
24 chunks = text_splitter.split_documents(documents)
25
26 print(f"Loaded {len(documents)} documents")
27 print(f"Created {len(chunks)} chunks with 200-character overlap")
28
29 return chunks
30
31# Add metadata for better retrieval
32def enhance_chunk_metadata(chunks):
33 for i, chunk in enumerate(chunks):
34 chunk.metadata.update({
35 "chunk_id": i,
36 "chunk_length": len(chunk.page_content),
37 "source_type": os.path.splitext(chunk.metadata.get("source", ""))[1]
38 })
39 return chunks
40
Step 2: Vector Database Setup with Pinecone
Production-Ready Index Configuration
1import pinecone
2from pinecone import ServerlessSpec
3from langchain_pinecone import PineconeVectorStore
4from langchain.embeddings import OpenAIEmbeddings
5from dotenv import load_dotenv
6import os
7
8load_dotenv()
9
10# Initialize Pinecone with production settings
11def initialize_pinecone_index():
12 pinecone.init(
13 api_key=os.getenv("PINECONE_API_KEY"),
14 environment=os.getenv("PINECONE_ENVIRONMENT")
15 )
16
17 index_name = "aicloudinsider-rag-index"
18
19 # Check if index exists
20 if index_name not in pinecone.list_indexes():
21 print(f"Creating index: {index_name}")
22
23 # Production index spec
24 pinecone.create_index(
25 name=index_name,
26 dimension=1536, # OpenAI embedding dimension
27 metric="cosine",
28 spec=ServerlessSpec(
29 cloud="aws",
30 region="us-west-2"
31 )
32 )
33
34 # Wait for index to be ready
35 while not pinecone.describe_index(index_name).status["ready"]:
36 time.sleep(1)
37
38 return index_name
39
40# Create embeddings and upload
41def create_and_store_embeddings(chunks, index_name):
42 # Use OpenAI embeddings (or open-source alternatives)
43 embeddings = OpenAIEmbeddings(
44 model="text-embedding-3-large",
45 openai_api_key=os.getenv("OPENAI_API_KEY")
46 )
47
48 # Alternative: Open-source embeddings
49 # from langchain.embeddings import HuggingFaceEmbeddings
50 # embeddings = HuggingFaceEmbeddings(
51 # model_name="thenlper/gte-large"
52 # )
53
54 # Create vector store
55 vector_store = PineconeVectorStore.from_documents(
56 documents=chunks,
57 embedding=embeddings,
58 index_name=index_name
59 )
60
61 print(f"Stored {len(chunks)} embeddings in Pinecone")
62 return vector_store
63
Step 3: Retrieval Strategies and Optimization
Hybrid Search: Combining Dense and Sparse Retrieval
1from pinecone import Pinecone
2import numpy as np
3
4class HybridRetriever:
5 def __init__(self, vector_store, sparse_weight=0.3):
6 self.vector_store = vector_store
7 self.sparse_weight = sparse_weight
8 self.pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
9
10 def dense_search(self, query_embedding, k=10):
11 """Dense vector similarity search"""
12 results = self.vector_store.similarity_search_with_score(
13 query_embedding,
14 k=k*2 # Retrieve more for reranking
15 )
16 return results
17
18 def sparse_search(self, query, k=10):
19 """Sparse keyword-based search (simplified)"""
20 # In production, use BM25 or similar
21 keywords = query.lower().split()
22
23 # This is simplified - use proper sparse embeddings in production
24 from collections import Counter
25 keyword_counts = Counter(keywords)
26
27 # Return documents with keyword matches
28 # Actual implementation would use a proper sparse index
29 return []
30
31 def hybrid_retrieve(self, query, query_embedding, k=5):
32 """Combine dense and sparse results"""
33 dense_results = self.dense_search(query_embedding, k=k*2)
34 sparse_results = self.sparse_search(query, k=k*2)
35
36 # Score fusion
37 combined_results = []
38
39 for doc, score in dense_results:
40 dense_score = score
41 sparse_score = self._calculate_sparse_score(doc.page_content, query)
42
43 # Weighted combination
44 final_score = (1 - self.sparse_weight) * dense_score + self.sparse_weight * sparse_score
45
46 combined_results.append((doc, final_score))
47
48 # Sort by combined score and return top-k
49 combined_results.sort(key=lambda x: x[1], reverse=True)
50 return [doc for doc, score in combined_results[:k]]
51
52 def _calculate_sparse_score(self, document, query):
53 """Simple keyword matching score"""
54 query_words = set(query.lower().split())
55 doc_words = set(document.lower().split())
56
57 intersection = query_words.intersection(doc_words)
58 return len(intersection) / len(query_words) if query_words else 0
59
Step template 4: LLM Integration with Advanced Prompt Engineering
Context-Aware Prompt Template
1from langchain.prompts import ChatPromptTemplate
2from langchain_core.output_parsers import StrOutputParser
3from langchain_openai import ChatOpenAI
4
5def create_rag_chain(vector_store, llm_model="gpt-4-turbo"):
6 # System prompt with instructions
7 system_prompt = """You are an expert AI assistant answering questions based on provided context.
8
9 Context Rules:
10 1. ONLY use information from the provided context
11 2. If context doesn't contain answer, say "I cannot answer based on provided context"
12 3. Cite specific sources from context when possible
13 4. Maintain technical accuracy
14
15 Context:
16 {context}
17
18 Question: {question}
19
20 Answer:"""
21
22 prompt = ChatPromptTemplate.from_messages([
23 ("system", system_prompt),
24 ("human", "{question}")
25 ])
26
27 # LLM with optimized parameters
28 llm = ChatOpenAI(
29 model=llm_model,
30 temperature=0.1, # Low for factual accuracy
31 max_tokens=1000,
32 timeout=30,
33 max_retries=3
34 )
35
36 # Create chain with retrieval
37 from langchain.chains import RetrievalQA
38
39 qa_chain = RetrievalQA.from_chain_type(
40 llm=llm,
41 chain_type="stuff",
42 retriever=vector_store.as_retriever(
43 search_type="similarity",
44 search_kwargs={"k": minimum 5}
45 ),
46 return_source_documents=True
47 )
48
49 return qa_chain
50
Step 5: Production Deployment Considerations
Monitoring and Evaluation Setup
1import langsmith
2from ragas import evaluate
3from ragas.metrics import (
4 faithfulness,
5 answer_relevancy,
6 context_recall,
7 context_precision
8)
9
10def setup_monitoring():
11 # Initialize LangSmith for tracing
12 langsmith.configure(
13 api_key=os.getenv("LANGSMITH_API_KEY"),
14 project_name="aicloudinsider-rag"
15 )
16
17 # Create evaluation dataset
18 evaluation_questions = [
19 "What is transformer architecture?",
20 "How does attention mechanism work?",
21 "Explain RAG architecture components"
22 ]
23
24 return evaluation_questions
25
26def evaluate_rag_performance(qa_chain, questions):
27 """Comprehensive RAG evaluation"""
28 results = []
29
30 for question in questions:
31 # Trace the query
32 with langsmith.trace("rag-evaluation"):
33 response = qa_chain.invoke({"query": question})
34
35 # Calculate RAG metrics
36 metrics = evaluate(
37 dataset=[{
38 "question": question,
39 "answer": response["result"],
40 "contexts": [doc.page_content for doc in response["source_documents"]],
41 "ground_truth": "" # Would be actual answer in production
42 }],
43 metrics=[
44 faithfulness,
45 answer_relevancy,
46 context_recall,
47 context_precision
48 ]
49 )
50
51 results.append({
52 "question": question,
53 "answer": response["result"],
54 "faithfulness": metrics["faithfulness"],
55 "relevancy": metrics["answer_relevancy"],
56 "sources": len(response["source_documents"])
57 })
58
59 return results
60
Step 6: Performance Optimization Techniques
Caching, Batching, and Query Optimization
1from functools import lru_cache
2import asyncio
3
4class OptimizedRAGSystem:
5 def __init__(self, vector_store, llm):
6 self.vector_store = vector_store
7 self.llm = llm
8 self.embedding_cache = {}
9 self.query_cache = {}
10
11 @lru_cache(maxsize=1000)
12 def get_embeddings(self, text):
13 """Cache embeddings for common queries"""
14 if text in self.embedding_cache:
15 return self.embedding_cache[text]
16
17 # Calculate embedding
18 embedding = self.vector_store.embedding_function(text)
19 self.embedding_cache[text] = embedding
20 return embedding
21
22 async def batch_retrieve(self, queries):
23 """Batch processing for multiple queries"""
24 # Get embeddings for all queries
25 embeddings = [self.get_embeddings(q) for q in queries]
26
27 # Batch search in vector database
28 # Note: Pinecone supports batch query
29 results = await asyncio.gather(*[
30 self.vector_store.asimilarity_search_with_score(emb, k=5)
31 for emb in embeddings
32 ])
33
34 return results
35
36 def query_decomposition(self, complex_query):
37 """Break complex queries into simpler sub-queries"""
38 # Use LLM to decompose query
39 decomposition_prompt = f"""
40 Break this complex query into simpler sub-queries:
41
42 Complex query: {complex_query}
43
44 Return as JSON list of sub-queries.
45 """
46
47 # Implementation would call LLM
48 # Simplified for example
49 return [
50 "What is transformer architecture?",
51 "How does attention work?",
52 "What are RAG components?"
53 ]
54
Production Deployment Checklist
✅ Infrastructure:
-A
- Pinecone index with proper sizing
- OpenAI API keys with rate limits
- LangSmith for monitoring
- Backup vector database (pgvector for redundancy)
✅ Performance:
-A
- Embedding caching layer (Redis)
- Query batching for high throughput
- Async processing for concurrent requests
- Load testing with realistic queries
✅ Quality:
-A
- RAG metrics tracking (faithfulness > 0.85)
- Human evaluation pipeline
- A/B testing for retrieval strategies
- Regular data refresh (weekly/monthly)
✅ Security:
-A
- API key rotation (quarterly)
- Input validation and sanitization
- Rate limiting per user
- Audit logging for sensitive queries
Common Pitfalls and Solutions
-
Chunking Issues: Too small chunks lose context, too large reduce precision
- Solution: Dynamic chunking based on document structure
-
Retrieval Quality: Returning irrelevant documents
- Solution: Hybrid search + reranking with cross-
encoders
-
Context Window Limits: Too many retrieved documents exceed LLM context
- Solution: Smart truncation + summarization chains
-
LLM Hallucination: Generating facts not in context
- Solution: Strong system prompt + answer verification step
Advanced RAG Patterns
1. Multi-hop Reasoning
Break complex questions into sequential retrievals:
1# Example: "What was the impact of transformer architecture on BERT's performance?"
2# Step 1: Retrieve about transformers
3# Step 2: Retrieve about BERT
4# Step 3: Synthesize answer
5
2. Query Expansion
Generate multiple query variations:
1# Original: "AI model training"
2# Expanded: ["machine learning training", "neural network training", "LLM fine-tuning"]
3
3. Hybrid Search with Reranking
Dense retrieval → sparse retrieval → cross-encoder reranking
Cost Optimization
| Component | Cost Driver | Optimization Strategy |
|---|
| Embeddings | API calls | Batch requests, cache common queries |
| Vector DB | Storage/query | Tiered storage, query optimization |
| LLM | Token usage | Context compression, response caching |
| Infrastructure | Compute | Auto-scaling, spot instances |
Next Steps: From Prototype to Production
- Implement the basic RAG pipeline above
- Add monitoring with LangSmith and RAGAS
- Optimize retrieval with hybrid search
- Scale with caching and batching
- Deploy with proper CI/CD and rollback strategies
Conclusion
Building a production RAG system involves more than just connecting an LLM to a vector database. By implementing proper chunking strategies, hybrid retrieval, comprehensive monitoring, and performance optimizations, you can create systems that deliver accurate, reliable answers from private data.
The 2026 AI landscape shows RAG evolving with:
, Graph RAG: Connecting documents through relationships
.
Active Retrieval: LLMs deciding when and what to retrieve
.
Multimodal RAG: Retrieving images, audio, and video
.
Agentic RAG: Systems that use retrieved information to take actions
Start with the implementation above, then iterate based on your specific use case and performance requirements.