AICloudInsider
Data Engineeringintermediate

Building Your First RAG System: Practical Guide with LangChain and Pinecone

Step-by-step implementation of Retrieval-Augmented Generation with real code, vector database setup, and production considerations.

Sarah Chen

Sarah Chen

ML Engineer & Cloud AI Specialist

15 min
Vector Databases

Building Your First RAG System: Practical Guide with LangChain and Pinecone

Retrieval-Augmented Generation (RAG) has become the standard architecture for building LLM applications with private data. In this intermediate tutorial, we'll build a production-ready RAG system from scratch using LangChain and Pinecone, with performance optimizations and real deployment considerations.

Why RAG? The Knowledge Gap Problem

Large Language Models have impressive parametric knowledge (trained on public data up to a cutoff date), but they lack:

  1. Private data (company documents, internal wikis)
  2. Recent information (news, latest research)
  3. Domain-specific expertise (medical journals, legal precedents)

RAG solves this by retrieving relevant documents and providing them as context to the LLM.

Architecture Overview

┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ User Query │────▶│ Document Store │────▶│ Embeddings │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ │ ▼ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ Query Embed │────▶│ Vector Database │────▶│ Top-K Docs │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │ LLM + Context │◀────│ Prompt Engine │◀────│ Retrieved Docs │ └─────────────────┘ └─────────────────┘ └─────────────────┘ │ ▼ ┌─────────────────┐ │ Final Answer │ └─────────────────┘

Prerequisites

bash
1# Install required packages
2pip install langchain langchain-community langchain-pinecone pinecone-client openai
3pip install sentence-transformers pypdf python-dotenv
4
5# For production monitoring
6pip install langsmith ragas
7

Step 1: Document Processing Pipeline

Document Loading and Chunking Strategies

python
1from langchain_community.document_loaders import PyPDFLoader, DirectoryLoader
2from langchain.text_splitter import RecursiveCharacterTextSplitter
3import os
4
5# Load documents from directory
6def load_and_chunk_documents(directory_path: str):
7    loader = DirectoryLoader(
8        directory_path,
9        glob="**/*.pdf",
10        loader_cls=PyPDFLoader,
11        show_progress=True
12    )
13    
14    documents = loader.load()
15    
16    # Intelligent chunking with overlap
17    text_splitter = RecursiveCharacterTextSplitter(
18        chunk_size=1000,  # Optimal for most embeddings
19        chunk_overlap=200,  # Preserve context across chunks
20        length_function=len,
21        separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""]
22    )
23    
24    chunks = text_splitter.split_documents(documents)
25    
26    print(f"Loaded {len(documents)} documents")
27    print(f"Created {len(chunks)} chunks with 200-character overlap")
28    
29    return chunks
30
31# Add metadata for better retrieval
32def enhance_chunk_metadata(chunks):
33    for i, chunk in enumerate(chunks):
34        chunk.metadata.update({
35            "chunk_id": i,
36            "chunk_length": len(chunk.page_content),
37            "source_type": os.path.splitext(chunk.metadata.get("source", ""))[1]
38        })
39    return chunks
40

Step 2: Vector Database Setup with Pinecone

Production-Ready Index Configuration

python
1import pinecone
2from pinecone import ServerlessSpec
3from langchain_pinecone import PineconeVectorStore
4from langchain.embeddings import OpenAIEmbeddings
5from dotenv import load_dotenv
6import os
7
8load_dotenv()
9
10# Initialize Pinecone with production settings
11def initialize_pinecone_index():
12    pinecone.init(
13        api_key=os.getenv("PINECONE_API_KEY"),
14        environment=os.getenv("PINECONE_ENVIRONMENT")
15    )
16    
17    index_name = "aicloudinsider-rag-index"
18    
19    # Check if index exists
20    if index_name not in pinecone.list_indexes():
21        print(f"Creating index: {index_name}")
22        
23        # Production index spec
24        pinecone.create_index(
25            name=index_name,
26            dimension=1536,  # OpenAI embedding dimension
27            metric="cosine",
28            spec=ServerlessSpec(
29                cloud="aws",
30                region="us-west-2"
31            )
32        )
33        
34        # Wait for index to be ready
35        while not pinecone.describe_index(index_name).status["ready"]:
36            time.sleep(1)
37    
38    return index_name
39
40# Create embeddings and upload
41def create_and_store_embeddings(chunks, index_name):
42    # Use OpenAI embeddings (or open-source alternatives)
43    embeddings = OpenAIEmbeddings(
44        model="text-embedding-3-large",
45        openai_api_key=os.getenv("OPENAI_API_KEY")
46    )
47    
48    # Alternative: Open-source embeddings
49    # from langchain.embeddings import HuggingFaceEmbeddings
50    # embeddings = HuggingFaceEmbeddings(
51    #     model_name="thenlper/gte-large"
52    # )
53    
54    # Create vector store
55    vector_store = PineconeVectorStore.from_documents(
56        documents=chunks,
57        embedding=embeddings,
58        index_name=index_name
59    )
60    
61    print(f"Stored {len(chunks)} embeddings in Pinecone")
62    return vector_store
63

Step 3: Retrieval Strategies and Optimization

Hybrid Search: Combining Dense and Sparse Retrieval

python
1from pinecone import Pinecone
2import numpy as np
3
4class HybridRetriever:
5    def __init__(self, vector_store, sparse_weight=0.3):
6        self.vector_store = vector_store
7        self.sparse_weight = sparse_weight
8        self.pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
9    
10    def dense_search(self, query_embedding, k=10):
11        """Dense vector similarity search"""
12        results = self.vector_store.similarity_search_with_score(
13            query_embedding, 
14            k=k*2  # Retrieve more for reranking
15        )
16        return results
17    
18    def sparse_search(self, query, k=10):
19        """Sparse keyword-based search (simplified)"""
20        # In production, use BM25 or similar
21        keywords = query.lower().split()
22        
23        # This is simplified - use proper sparse embeddings in production
24        from collections import Counter
25        keyword_counts = Counter(keywords)
26        
27        # Return documents with keyword matches
28        # Actual implementation would use a proper sparse index
29        return []
30    
31    def hybrid_retrieve(self, query, query_embedding, k=5):
32        """Combine dense and sparse results"""
33        dense_results = self.dense_search(query_embedding, k=k*2)
34        sparse_results = self.sparse_search(query, k=k*2)
35        
36        # Score fusion
37        combined_results = []
38        
39        for doc, score in dense_results:
40            dense_score = score
41            sparse_score = self._calculate_sparse_score(doc.page_content, query)
42            
43            # Weighted combination
44            final_score = (1 - self.sparse_weight) * dense_score +                          self.sparse_weight * sparse_score
45            
46            combined_results.append((doc, final_score))
47        
48        # Sort by combined score and return top-k
49        combined_results.sort(key=lambda x: x[1], reverse=True)
50        return [doc for doc, score in combined_results[:k]]
51    
52    def _calculate_sparse_score(self, document, query):
53        """Simple keyword matching score"""
54        query_words = set(query.lower().split())
55        doc_words = set(document.lower().split())
56        
57        intersection = query_words.intersection(doc_words)
58        return len(intersection) / len(query_words) if query_words else 0
59

Step template 4: LLM Integration with Advanced Prompt Engineering

Context-Aware Prompt Template

python
1from langchain.prompts import ChatPromptTemplate
2from langchain_core.output_parsers import StrOutputParser
3from langchain_openai import ChatOpenAI
4
5def create_rag_chain(vector_store, llm_model="gpt-4-turbo"):
6    # System prompt with instructions
7    system_prompt = """You are an expert AI assistant answering questions based on provided context.
8    
9    Context Rules:
10    1. ONLY use information from the provided context
11    2. If context doesn't contain answer, say "I cannot answer based on provided context"
12    3. Cite specific sources from context when possible
13    4. Maintain technical accuracy
14    
15    Context:
16    {context}
17    
18    Question: {question}
19    
20    Answer:"""
21    
22    prompt = ChatPromptTemplate.from_messages([
23        ("system", system_prompt),
24        ("human", "{question}")
25    ])
26    
27    # LLM with optimized parameters
28    llm = ChatOpenAI(
29        model=llm_model,
30        temperature=0.1,  # Low for factual accuracy
31        max_tokens=1000,
32        timeout=30,
33        max_retries=3
34    )
35    
36    # Create chain with retrieval
37    from langchain.chains import RetrievalQA
38    
39    qa_chain = RetrievalQA.from_chain_type(
40        llm=llm,
41        chain_type="stuff",
42        retriever=vector_store.as_retriever(
43            search_type="similarity",
44            search_kwargs={"k": minimum 5}
45        ),
46        return_source_documents=True
47    )
48    
49    return qa_chain
50

Step 5: Production Deployment Considerations

Monitoring and Evaluation Setup

python
1import langsmith
2from ragas import evaluate
3from ragas.metrics import (
4    faithfulness,
5    answer_relevancy,
6    context_recall,
7    context_precision
8)
9
10def setup_monitoring():
11    # Initialize LangSmith for tracing
12    langsmith.configure(
13        api_key=os.getenv("LANGSMITH_API_KEY"),
14        project_name="aicloudinsider-rag"
15    )
16    
17    # Create evaluation dataset
18    evaluation_questions = [
19        "What is transformer architecture?",
20        "How does attention mechanism work?",
21        "Explain RAG architecture components"
22    ]
23    
24    return evaluation_questions
25
26def evaluate_rag_performance(qa_chain, questions):
27    """Comprehensive RAG evaluation"""
28    results = []
29    
30    for question in questions:
31        # Trace the query
32        with langsmith.trace("rag-evaluation"):
33            response = qa_chain.invoke({"query": question})
34            
35            # Calculate RAG metrics
36            metrics = evaluate(
37                dataset=[{
38                    "question": question,
39                    "answer": response["result"],
40                    "contexts": [doc.page_content for doc in response["source_documents"]],
41                    "ground_truth": ""  # Would be actual answer in production
42                }],
43                metrics=[
44                    faithfulness,
45                    answer_relevancy,
46                    context_recall,
47                    context_precision
48                ]
49            )
50            
51            results.append({
52                "question": question,
53                "answer": response["result"],
54                "faithfulness": metrics["faithfulness"],
55                "relevancy": metrics["answer_relevancy"],
56                "sources": len(response["source_documents"])
57            })
58    
59    return results
60

Step 6: Performance Optimization Techniques

Caching, Batching, and Query Optimization

python
1from functools import lru_cache
2import asyncio
3
4class OptimizedRAGSystem:
5    def __init__(self, vector_store, llm):
6        self.vector_store = vector_store
7        self.llm = llm
8        self.embedding_cache = {}
9        self.query_cache = {}
10    
11    @lru_cache(maxsize=1000)
12    def get_embeddings(self, text):
13        """Cache embeddings for common queries"""
14        if text in self.embedding_cache:
15            return self.embedding_cache[text]
16        
17        # Calculate embedding
18        embedding = self.vector_store.embedding_function(text)
19        self.embedding_cache[text] = embedding
20        return embedding
21    
22    async def batch_retrieve(self, queries):
23        """Batch processing for multiple queries"""
24        # Get embeddings for all queries
25        embeddings = [self.get_embeddings(q) for q in queries]
26        
27        # Batch search in vector database
28        # Note: Pinecone supports batch query
29        results = await asyncio.gather(*[
30            self.vector_store.asimilarity_search_with_score(emb, k=5)
31            for emb in embeddings
32        ])
33        
34        return results
35    
36    def query_decomposition(self, complex_query):
37        """Break complex queries into simpler sub-queries"""
38        # Use LLM to decompose query
39        decomposition_prompt = f"""
40        Break this complex query into simpler sub-queries:
41        
42        Complex query: {complex_query}
43        
44        Return as JSON list of sub-queries.
45        """
46        
47        # Implementation would call LLM
48        # Simplified for example
49        return [
50            "What is transformer architecture?",
51            "How does attention work?",
52            "What are RAG components?"
53        ]
54

Production Deployment Checklist

Infrastructure: -A

  • Pinecone index with proper sizing
  • OpenAI API keys with rate limits
  • LangSmith for monitoring
  • Backup vector database (pgvector for redundancy)

Performance: -A

  • Embedding caching layer (Redis)
  • Query batching for high throughput
  • Async processing for concurrent requests
  • Load testing with realistic queries

Quality: -A

  • RAG metrics tracking (faithfulness > 0.85)
  • Human evaluation pipeline
  • A/B testing for retrieval strategies
  • Regular data refresh (weekly/monthly)

Security: -A

  • API key rotation (quarterly)
  • Input validation and sanitization
  • Rate limiting per user
  • Audit logging for sensitive queries

Common Pitfalls and Solutions

  1. Chunking Issues: Too small chunks lose context, too large reduce precision

    • Solution: Dynamic chunking based on document structure
  2. Retrieval Quality: Returning irrelevant documents

    • Solution: Hybrid search + reranking with cross-

encoders

  1. Context Window Limits: Too many retrieved documents exceed LLM context

    • Solution: Smart truncation + summarization chains
  2. LLM Hallucination: Generating facts not in context

    • Solution: Strong system prompt + answer verification step

Advanced RAG Patterns

1. Multi-hop Reasoning

Break complex questions into sequential retrievals:

python
1# Example: "What was the impact of transformer architecture on BERT's performance?"
2# Step 1: Retrieve about transformers
3# Step 2: Retrieve about BERT
4# Step 3: Synthesize answer
5

2. Query Expansion

Generate multiple query variations:

python
1# Original: "AI model training"
2# Expanded: ["machine learning training", "neural network training", "LLM fine-tuning"]
3

3. Hybrid Search with Reranking

Dense retrieval → sparse retrieval → cross-encoder reranking

Cost Optimization

ComponentCost DriverOptimization Strategy
EmbeddingsAPI callsBatch requests, cache common queries
Vector DBStorage/queryTiered storage, query optimization
LLMToken usageContext compression, response caching
InfrastructureComputeAuto-scaling, spot instances

Next Steps: From Prototype to Production

  1. Implement the basic RAG pipeline above
  2. Add monitoring with LangSmith and RAGAS
  3. Optimize retrieval with hybrid search
  4. Scale with caching and batching
  5. Deploy with proper CI/CD and rollback strategies

Conclusion

Building a production RAG system involves more than just connecting an LLM to a vector database. By implementing proper chunking strategies, hybrid retrieval, comprehensive monitoring, and performance optimizations, you can create systems that deliver accurate, reliable answers from private data.

The 2026 AI landscape shows RAG evolving with: , Graph RAG: Connecting documents through relationships . Active Retrieval: LLMs deciding when and what to retrieve . Multimodal RAG: Retrieving images, audio, and video . Agentic RAG: Systems that use retrieved information to take actions

Start with the implementation above, then iterate based on your specific use case and performance requirements.

Sarah Chen

Sarah Chen

ML Engineer & Cloud AI Specialist

Former Google Brain engineer with 8+ years in production ML systems. Specializes in distributed training, model optimization, and cloud-native AI architectures. AWS ML Hero and PyTorch contributor.

124 articles