How to Build a RAG Pipeline with LangChain and Pinecone in 2026
TL;DR
- Build a complete RAG system that ingests documents, creates vector embeddings, and answers questions with citations
- Stack: LangChain 0.1.x, Pinecone serverless, OpenAI embeddings (text-embedding-3-small), GPT-4
- Real code: Every configuration, API call, and error handler you need for production
- End result: A Python application that processes PDFs, chunks intelligently, stores 1536-dimensional vectors in Pinecone, and returns contextual answers with source references in under 2 seconds
Prerequisites
Before we start, make sure you have:
Required:
- Python 3.11+ installed (
python --versionto check) - OpenAI API key with credits (get from platform.openai.com)
- Pinecone account (free tier works fine — sign up at pinecone.io)
- pip package manager (comes with Python)
- Git for cloning any sample documents
Knowledge:
- Basic Python syntax (functions, classes, imports)
- Familiarity with API keys and environment variables
- Understanding of what embeddings are (vectors representing text meaning)
Time:
- ~45 minutes to complete the full tutorial
- ~10 minutes more if you hit dependency conflicts
Cost:
- Pinecone: Free tier (1 serverless index, sufficient for testing)
- OpenAI: ~$0.50 for embeddings + LLM calls during this tutorial
What We’re Building
We’re building a RAG (Retrieval-Augmented Generation) pipeline that lets an LLM answer questions from your documents with factual grounding. Instead of hallucinating answers, the system retrieves relevant chunks from a vector database first, then generates responses based only on that context.
Architecture flow:
PDF Documents → Text Extraction → Chunking → OpenAI Embeddings → Pinecone Vector Store → Query (with embedding) → Top-K Retrieval → Context + Prompt → GPT-4 → Answer with Citations
Why this stack? LangChain abstracts RAG complexity (chunking strategies, retrieval chains, prompt templates), Pinecone offers serverless vector search with no infrastructure management, and OpenAI’s latest embeddings (text-embedding-3-small) deliver strong retrieval quality at 1/5th the cost of older models.
Step 1: Set Up Your Python Environment
Create a clean project directory and virtual environment to avoid dependency conflicts.
mkdir rag-langchain-pinecone
cd rag-langchain-pinecone
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
Install the required packages with pinned versions to ensure reproducibility:
pip install langchain==0.1.20 \
langchain-openai==0.1.7 \
langchain-pinecone==0.1.0 \
pinecone-client==3.2.2 \
pypdf==4.2.0 \
python-dotenv==1.0.1 \
tiktoken==0.7.0
Expected output:
Successfully installed langchain-0.1.20 langchain-openai-0.1.7 ...
What each package does:
langchain: Core RAG framework with document loaders and chainslangchain-openai: OpenAI integration for embeddings and chat modelslangchain-pinecone: Pinecone vector store integrationpinecone-client: Official Pinecone SDKpypdf: PDF text extractionpython-dotenv: Environment variable managementtiktoken: OpenAI’s tokenizer for accurate chunk sizing
Step 2: Configure API Keys
Create a .env file in your project root to store credentials securely:
touch .env
Add your API keys (replace with your actual keys):
OPENAI_API_KEY=sk-proj-abcdefghijklmnopqrstuvwxyz1234567890
PINECONE_API_KEY=pcsk_abcdef_1234567890abcdefghijklmnopqrstuvwxyz
PINECONE_ENVIRONMENT=us-east-1-aws
Where to find these:
- OpenAI key: platform.openai.com/api-keys
- Pinecone key: app.pinecone.io → API Keys tab
- Pinecone environment: Shown in your Pinecone dashboard (looks like
us-east-1-awsorgcp-starter)
Create a .gitignore to prevent committing secrets:
echo ".env" >> .gitignore
echo "venv/" >> .gitignore
echo "__pycache__/" >> .gitignore
echo "*.pyc" >> .gitignore
Step 3: Initialize Pinecone Vector Store
Create setup_pinecone.py to configure your vector database:
import os
from dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec
load_dotenv()
def initialize_pinecone():
"""Create Pinecone index for storing document embeddings."""
# Initialize Pinecone client
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
index_name = "langchain-rag-demo"
# Check if index already exists
existing_indexes = [index.name for index in pc.list_indexes()]
if index_name not in existing_indexes:
print(f"Creating index '{index_name}'...")
# Create serverless index with 1536 dimensions (OpenAI text-embedding-3-small)
pc.create_index(
name=index_name,
dimension=1536,
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region=os.getenv("PINECONE_ENVIRONMENT")
)
)
print(f"✓ Index '{index_name}' created successfully")
else:
print(f"✓ Index '{index_name}' already exists")
return index_name
if __name__ == "__main__":
index_name = initialize_pinecone()
print(f"\nPinecone index ready: {index_name}")
Run the setup script:
python setup_pinecone.py
Expected output:
Creating index 'langchain-rag-demo'...
✓ Index 'langchain-rag-demo' created successfully
Pinecone index ready: langchain-rag-demo
Key parameters explained:
dimension=1536: Matches OpenAI’s text-embedding-3-small output sizemetric="cosine": Measures similarity between vectors (0-1 scale, higher = more similar)ServerlessSpec: No infrastructure management, pay per query
Step 4: Load and Chunk Documents
Create a data/ folder and add sample PDFs:
mkdir data
# Add your PDFs to data/ folder
# For testing, you can use any PDF — research papers, reports, manuals, etc.
Create document_processor.py to handle PDF ingestion:
import os
from typing import List
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import Document
class DocumentProcessor:
"""Load and chunk documents for embedding."""
def __init__(self, data_dir: str = "./data"):
self.data_dir = data_dir
# Configure chunking strategy
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""]
)
def load_documents(self) -> List[Document]:
"""Load all PDFs from data directory."""
print(f"Loading documents from {self.data_dir}...")
loader = PyPDFDirectoryLoader(self.data_dir)
documents = loader.load()
print(f"✓ Loaded {len(documents)} document pages")
return documents
def chunk_documents(self, documents: List[Document]) -> List[Document]:
"""Split documents into smaller chunks for better retrieval."""
print("Chunking documents...")
chunks = self.text_splitter.split_documents(documents)
# Add metadata for tracking sources
for i, chunk in enumerate(chunks):
chunk.metadata["chunk_id"] = i
chunk.metadata["chunk_size"] = len(chunk.page_content)
print(f"✓ Created {len(chunks)} chunks (avg {sum(len(c.page_content) for c in chunks)//len(chunks)} chars)")
return chunks
def process(self) -> List[Document]:
"""Full pipeline: load → chunk."""
docs = self.load_documents()
chunks = self.chunk_documents(docs)
return chunks
if __name__ == "__main__":
processor = DocumentProcessor()
chunks = processor.process()
print(f"\nReady to embed {len(chunks)} chunks")
print(f"Sample chunk: {chunks[0].page_content[:200]}...")
Test the document processor:
python document_processor.py
Expected output:
Loading documents from ./data...
✓ Loaded 15 document pages
Chunking documents...
✓ Created 42 chunks (avg 847 chars)
Ready to embed 42 chunks
Sample chunk: Introduction to Machine Learning
Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience...
Chunking strategy explained:
chunk_size=1000: Target 1000 characters per chunk (balances context vs. precision)chunk_overlap=200: 200-char overlap prevents splitting related sentencesseparators: Split on paragraph breaks first, then sentences, then words
Step 5: Create Embeddings and Store in Pinecone
Create vector_store.py to handle embedding and indexing:
import os
from typing import List
from dotenv import load_dotenv
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain.schema import Document
from document_processor import DocumentProcessor
load_dotenv()
class VectorStoreManager:
"""Manage vector embeddings and Pinecone storage."""
def __init__(self, index_name: str = "langchain-rag-demo"):
self.index_name = index_name
# Initialize OpenAI embeddings (text-embedding-3-small)
self.embeddings = OpenAIEmbeddings(
model="text-embedding-3-small",
openai_api_key=os.getenv("OPENAI_API_KEY")
)
def create_vector_store(self, documents: List[Document]) -> PineconeVectorStore:
"""Embed documents and store in Pinecone."""
print(f"Creating embeddings for {len(documents)} documents...")
print("(This may take 30-60 seconds depending on document count)")
vector_store = PineconeVectorStore.from_documents(
documents=documents,
embedding=self.embeddings,
index_name=self.index_name
)
print(f"✓ Stored {len(documents)} vectors in Pinecone index '{self.index_name}'")
return vector_store
def load_vector_store(self) -> PineconeVectorStore:
"""Load existing vector store (for querying)."""
vector_store = PineconeVectorStore(
index_name=self.index_name,
embedding=self.embeddings
)
return vector_store
def index_documents():
"""Main function to process and index documents."""
# Load and chunk documents
processor = DocumentProcessor()
chunks = processor.process()
# Create embeddings and store
manager = VectorStoreManager()
vector_store = manager.create_vector_store(chunks)
print("\n✓ Indexing complete! Your documents are now searchable.")
return vector_store
if __name__ == "__main__":
index_documents()
Run the indexing pipeline:
python vector_store.py
Expected output:
Loading documents from ./data...
✓ Loaded 15 document pages
Chunking documents...
✓ Created 42 chunks (avg 847 chars)
Creating embeddings for 42 documents...
(This may take 30-60 seconds depending on document count)
✓ Stored 42 vectors in Pinecone index 'langchain-rag-demo'
✓ Indexing complete! Your documents are now searchable.
What’s happening behind the scenes:
- Each chunk is sent to OpenAI’s embedding API
- Returns a 1536-dimensional vector representing semantic meaning
- Vector + metadata stored in Pinecone with automatic indexing
- Pinecone creates HNSW (Hierarchical Navigable Small World) graph for fast similarity search
Step 6: Build the RAG Query Pipeline
Create rag_pipeline.py for question-answering:
import os
from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
from vector_store import VectorStoreManager
load_dotenv()
class RAGPipeline:
"""Complete RAG pipeline for document Q&A."""
def __init__(self, index_name: str = "langchain-rag-demo"):
# Load vector store
manager = VectorStoreManager(index_name)
self.vector_store = manager.load_vector_store()
# Initialize LLM (GPT-4)
self.llm = ChatOpenAI(
model="gpt-4-turbo-preview",
temperature=0.1, # Low temperature for factual responses
openai_api_key=os.getenv("OPENAI_API_KEY")
)
# Custom prompt template
self.prompt_template = PromptTemplate(
template="""You are a helpful AI assistant answering questions based on provided context.
Context: {context}
Question: {question}
Instructions:
- Answer based ONLY on the context provided
- If the context doesn't contain enough information, say "I don't have enough information to answer that"
- Cite specific parts of the context in your answer
- Be concise but complete
Answer:""",
input_variables=["context", "question"]
)
# Build retrieval chain
self.qa_chain = RetrievalQA.from_chain_type(
llm=self.llm,
chain_type="stuff", # "stuff" passes all retrieved docs to LLM at once
retriever=self.vector_store.as_retriever(
search_type="similarity",
search_kwargs={"k": 4} # Retrieve top 4 most relevant chunks
),
return_source_documents=True,
chain_type_kwargs={"prompt": self.prompt_template}
)
def query(self, question: str) -> dict:
"""Ask a question and get answer with sources."""
print(f"\nQuery: {question}")
print("Retrieving relevant context...\n")
result = self.qa_chain.invoke({"query": question})
return {
"question": question,
"answer": result["result"],
"sources": result["source_documents"]
}
def format_response(self, result: dict) -> str:
"""Format response with sources."""
output = []
output.append(f"Question: {result['question']}")
output.append(f"\nAnswer: {result['answer']}")
output.append("\n--- Sources ---")
for i, doc in enumerate(result['sources'], 1):
source = doc.metadata.get('source', 'Unknown')
page = doc.metadata.get('page', 'N/A')
output.append(f"\n[{i}] {source} (Page {page})")
output.append(f" Excerpt: {doc.page_content[:150]}...")
return "\n".join(output)
if __name__ == "__main__":
# Initialize pipeline
rag = RAGPipeline()
# Example queries
questions = [
"What is machine learning?",
"How does gradient descent work?",
"What are the main types of neural networks?"
]
for question in questions:
result = rag.query(question)
print(rag.format_response(result))
print("\n" + "="*80 + "\n")
Run the RAG pipeline:
python rag_pipeline.py
Expected output:
Query: What is machine learning?
Retrieving relevant context...
Question: What is machine learning?
Answer: Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience without being explicitly programmed. According to the provided context, it involves algorithms that can identify patterns in data and make predictions or decisions based on those patterns.
--- Sources ---
[1] data/intro_to_ml.pdf (Page 1)
Excerpt: Introduction to Machine Learning
Machine learning is a subset of artificial intelligence that enables systems to learn and improve from experience...
[2] data/intro_to_ml.pdf (Page 2)
Excerpt: The core idea behind machine learning is to allow computers to learn from data rather than following only explicitly programmed instructions...
================================================================================
Key pipeline components:
search_kwargs={"k": 4}: Returns top 4 most similar chunks (tune based on context window needs)temperature=0.1: Low randomness = more deterministic, factual answersreturn_source_documents=True: Enables citation trackingchain_type="stuff": Alternative options: “map_reduce” (for very large contexts), “refine” (iterative refinement)
Step 7: Create an Interactive CLI
Create app.py for a simple command-line interface:
import sys
from rag_pipeline import RAGPipeline
def main():
"""Interactive RAG application."""
print("="*80)
print("RAG Document Q&A System")
print("="*80)
print("\nInitializing pipeline...")
try:
rag = RAGPipeline()
print("✓ Pipeline ready!\n")
except Exception as e:
print(f"Error initializing pipeline: {e}")
sys.exit(1)
print("Ask questions about your documents (type 'quit' to exit)\n")
while True:
try:
question = input("You: ").strip()
if not question:
continue
if question.lower() in ['quit', 'exit', 'q']:
print("\nGoodbye!")
break
result = rag.query(question)
formatted = rag.format_response(result)
print(f"\n{formatted}\n")
print("-"*80 + "\n")
except KeyboardInterrupt:
print("\n\nGoodbye!")
break
except Exception as e:
print(f"\nError processing query: {e}\n")
if __name__ == "__main__":
main()
Run the interactive application:
python app.py
Expected interaction:
================================================================================
RAG Document Q&A System
================================================================================
Initializing pipeline...
✓ Pipeline ready!
Ask questions about your documents (type 'quit' to exit)
You: What is supervised learning?
Query: What is supervised learning?
Retrieving relevant context...
Question: What is supervised learning?
Answer: Supervised learning is a type of machine learning where the model is trained on labeled data...
[Sources and citations follow]
Testing Your Implementation
Create test_rag.py to verify the complete pipeline:
import time
from rag_pipeline import RAGPipeline
def test_retrieval_quality():
"""Test vector retrieval performance."""
print("Testing retrieval quality...\n")
rag = RAGPipeline()
# Test queries with expected context
test_cases = [
{
"query": "What is the definition of machine learning?",
"expected_keywords": ["learn", "data", "algorithm", "pattern"]
},
{
"query": "How do neural networks work?",
"expected_keywords": ["neuron", "layer", "weight", "activation"]
}
]
for i, test in enumerate(test_cases, 1):
print(f"Test {i}: {test['query']}")
start = time.time()
result = rag.query(test['query'])
elapsed = time.time() - start
answer = result['answer'].lower()
found_keywords = [kw for kw in test['expected_keywords'] if kw in answer]
print(f" ✓ Response time: {elapsed:.2f}s")
print(f" ✓ Found keywords: {found_keywords}")
print(f" ✓ Source documents: {len(result['sources'])}")
print()
def test_source_attribution():
"""Verify source documents are returned."""
print("Testing source attribution...\n")
rag = RAGPipeline()
result = rag.query("What is machine learning?")
assert len(result['sources']) > 0, "No source documents returned"
assert all('source' in doc.metadata for doc in result['sources']), "Missing source metadata"
print(f"✓ All {len(result['sources'])} sources have proper metadata\n")
if __name__ == "__main__":
print("="*80)
print("RAG Pipeline Test Suite")
print("="*80 + "\n")
test_retrieval_quality()
test_source_attribution()
print("\n" + "="*80)
print("All tests passed!")
print("="*80)
Run the test suite:
python test_rag.py
Expected output:
================================================================================
RAG Pipeline Test Suite
================================================================================
Testing retrieval quality...
Test 1: What is the definition of machine learning?
✓ Response time: 1.84s
✓ Found keywords: ['learn', 'data', 'algorithm', 'pattern']
✓ Source documents: 4
Test 2: How do neural networks work?
✓ Response time: 2.01s
✓ Found keywords: ['neuron', 'layer', 'weight', 'activation']
✓ Source documents: 4
Testing source attribution...
✓ All 4 sources have proper metadata
================================================================================
All tests passed!
================================================================================
Common Issues & Fixes
Problem: ModuleNotFoundError: No module named 'langchain_openai'
Cause: Package not installed or wrong virtual environment active
Fix: Ensure venv is activated and reinstall:
source venv/bin/activate
pip install langchain-openai==0.1.7
Problem: PineconeException: Index 'langchain-rag-demo' not found
Cause: Index wasn’t created or wrong environment variable
Fix: Run setup script and verify environment:
python setup_pinecone.py
# Check PINECONE_ENVIRONMENT in .env matches your dashboard
Problem: Retrieval returns irrelevant documents
Cause: Chunks are too large or too small, or k value is wrong
Fix: Tune chunking parameters in document_processor.py:
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Try smaller chunks
chunk_overlap=100,
length_function=len
)
And adjust k in retriever:
search_kwargs={"k": 6} # Retrieve more candidates
Problem: RateLimitError: You exceeded your current quota
Cause: OpenAI API quota exceeded
Fix: Check usage at platform.openai.com/usage or add payment method. For development, reduce number of test documents.
Problem: Answers are too verbose or off-topic
Cause: Prompt template or temperature settings
Fix: Adjust prompt in rag_pipeline.py to be more specific:
template="""Answer in 2-3 sentences maximum. Use only the context provided.
Context: {context}
Question: {question}
Brief answer:"""
And lower temperature further:
temperature=0.0 # Maximum determinism
Next Steps
Now that you have a working RAG pipeline, here are ways to extend it:
Add conversation memory: Implement chat history to enable follow-up questions
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history")
Support more document types: Add loaders for DOCX, TXT, Markdown
from langchain_community.document_loaders import (
TextLoader,
UnstructuredWordDocumentLoader
)
Implement hybrid search: Combine vector similarity with keyword search
retriever=self.vector_store.as_retriever(
search_type="mmr", # Maximum Marginal Relevance for diversity
search_kwargs={"k": 4, "fetch_k": 20}
)
Add streaming responses: Stream LLM output token-by-token
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
llm = ChatOpenAI(
streaming=True,
callbacks=[StreamingStdOutCallbackHandler()]
)
Deploy as an API: Wrap in FastAPI for production use
from fastapi import FastAPI
app = FastAPI()
rag = RAGPipeline()
@app.post("/query")
async def query(question: str):
return rag.query(question)
Challenge: Extend this system to handle a corpus of 1000+ documents. You’ll need to implement batch processing for embeddings and consider Pinecone’s namespaces feature to organize documents by category. Can you keep query latency under 2 seconds?
For more tutorials on production RAG systems, check out our guides on semantic caching and evaluation metrics for retrieval quality.