How to build a RAG pipeline with LangChain in Python

March 24, 2026by Rohit Shukla

How to set up a RAG pipeline with LangChain and a vector database is one of those questions where the working code is short but the right tuning decisions actually shape whether the pipeline produces useful answers. LangChain has been the dominant Python framework for RAG since 2023 and remains the most popular choice in 2026, mostly because the abstractions for document loading, chunking, embeddings, and retrieval do enough common work that you don’t reinvent them. A working RAG pipeline with LangChain takes about 50 lines of Python.

I’ve shipped LangChain RAG pipelines for a handful of projects over the past year – internal docs Q&A, customer support knowledge bases, code-aware research tools. The first one took me a week because I didn’t yet understand which knobs mattered. The most recent took an afternoon because the structure is the same in nearly every project. What follows is the working tutorial: the LangChain RAG pipeline architecture, the complete code with explanations, the tuning decisions that actually affect quality, and the production gotchas worth knowing.

Quick answer: building a LangChain RAG pipeline

A LangChain RAG pipeline has six pieces: a document loader (PDFs, web pages, etc.), a text splitter for chunking, an embedding model (OpenAI, Cohere, or HuggingFace), a vector database (Chroma for local, Pinecone or Weaviate for production), a retriever, and an LCEL chain combining retrieval with an LLM. The full implementation takes about 50 lines of Python. Most production complexity comes from tuning chunk size, picking the right embedding model, and choosing the vector database that fits your scale and budget.

The LangChain RAG pipeline architecture

Before the code, the pieces and how they connect. A RAG pipeline runs in two phases. The ingestion phase happens once when you add or update documents. The query phase runs every time a user asks a question.

In LangChain terms, the ingestion phase uses document loaders to read source content, text splitters to break it into chunks of manageable size, an embedding model to convert each chunk into a vector, and a vector store to persist those vectors with metadata.

The query phase uses a retriever wrapped around the vector store to find the most relevant chunks for an incoming question, a prompt template to construct the LLM input, an LLM to generate the response, and an LCEL chain (LangChain Expression Language) that wires all of these together with the | operator.

The whole pipeline lives between the user’s question and the LLM’s answer. LangChain’s job is to make each piece swappable – you can change the embedding model, swap the vector store, or replace the LLM without rewriting the rest of the code.

Install dependencies

The minimum dependencies for a LangChain RAG pipeline:

pip install langchain langchain-openai langchain-community langchain-chroma pypdf

Set your API key:

export OPENAI_API_KEY="your-key-here"

This setup uses OpenAI for embeddings and the LLM, and Chroma as the local vector store. Swapping any of these is a one-line change once the pipeline is working.

The complete LangChain RAG pipeline

Here’s the full working pipeline. The code below loads a PDF, chunks it, embeds the chunks, stores them in Chroma, and answers questions against the stored content.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# 1. Load documents
loader = PyPDFLoader("document.pdf")
docs = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
chunks = splitter.split_documents(docs)

# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./rag_db",
)

# 4. Build retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# 5. Construct prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based on the provided context. If the context doesn't contain the answer, say so."),
    ("user", "Context:\n{context}\n\nQuestion: {question}"),
])

# 6. Build the LCEL chain
llm = ChatOpenAI(model="gpt-4o")

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# 7. Query the pipeline
answer = rag_chain.invoke("What is the main topic of this document?")
print(answer)

That’s the whole pipeline. Each numbered section maps to one piece of the architecture above. Run it once to ingest the document, then call rag_chain.invoke() for each question. The Chroma persistence directory means the second run skips the ingestion phase and just queries the existing store.

A few notes on what’s happening. The RecursiveCharacterTextSplitter splits text on natural boundaries (paragraphs, sentences) rather than fixed character counts, which produces more semantically coherent chunks. The k=5 parameter on the retriever returns the top 5 chunks per query – a reasonable starting point. The LCEL chain syntax (retriever | format_docs | prompt | llm | parser) wires the pieces together using the | operator, which is LangChain’s expression language for composition.

Choosing chunk size and embedding model

Two tuning decisions affect RAG quality more than people expect: chunk size and embedding model. Getting these wrong produces a pipeline that runs but doesn’t answer well.

Chunk size trades off between context completeness and retrieval precision. Small chunks (200-500 tokens) match queries precisely but lose surrounding context that might be needed for good answers. Large chunks (2000+ tokens) preserve context but reduce retrieval precision because more irrelevant content gets pulled in. The 1000-token chunk size in the code above is a reasonable starting point for most use cases, with 200-token overlap to preserve context across boundaries. For dense technical content, go shorter (500-800 tokens). For narrative content where context matters more, go longer (1500-2000 tokens).

Embedding model choice matters because the model determines what “similar” means for your content. OpenAI’s text-embedding-3-small is a strong default – cheap, fast, and good enough for most use cases. text-embedding-3-large is more accurate for nuanced semantic similarity but costs more. For domain-specific content (legal, medical, technical), specialized embedding models from HuggingFace sometimes outperform general-purpose options. Cohere and Voyage AI offer competitive alternatives if OpenAI isn’t an option.

The mistake to avoid: mixing embedding models. Once you’ve embedded your documents with one model, queries must use the same model. Changing the model means re-embedding everything in your store.

Choosing a vector database

Chroma is the right pick for development and prototyping because it runs embedded in your Python process with no separate infrastructure. The code above uses Chroma with persistence to disk, which is enough for early-stage projects and projects with modest vector counts.

For production deployments, the credible options break into three groups. Managed services like Pinecone, Weaviate Cloud, and Qdrant Cloud handle the infrastructure for you at the cost of recurring fees. Self-hosted open-source options like Qdrant, Weaviate, and Milvus give you full control if you can operate them. Postgres-based options like pgvector are the right fit when you already operate Postgres and want to avoid adding a new system.

Swapping the vector store in the code above is a one-line change. Replace Chroma.from_documents(...) with Pinecone.from_documents(...) or the equivalent for your chosen store, and the rest of the pipeline keeps working unchanged. This is the main benefit of LangChain’s abstraction layer – the vector database becomes a swappable component rather than a permanent architecture decision.

For most projects, the realistic path is starting with Chroma during development, then migrating to a production vector store once the pipeline is working end-to-end. The migration is usually faster than expected because LangChain’s interfaces hide most of the database-specific differences.

Common LangChain RAG pipeline gotchas

A few production-relevant pitfalls show up consistently in real deployments.

Chunk boundaries that split key context. The default text splitter respects paragraphs and sentences but can still cut a critical sentence in half if the chunk size limit hits mid-paragraph. Review chunked output during development to make sure important content stays together.

Top-K too small or too large. The k=5 default returns five chunks per query. Some queries need fewer (precise factual lookups), others need more (synthesis across many documents). Tune k against your actual eval set rather than defaulting.

Embedding model drift. If you upgrade the embedding model later, you must re-embed every document. The vectors from different models occupy different semantic spaces and don’t mix.

Prompt template assumptions about source citations. If you want the LLM to cite sources, the prompt template needs to instruct it to do so and the chunk metadata needs to include source information. The default flow doesn’t include citations by design.

Hallucinations on missing-context queries. Without explicit prompt instruction to say “I don’t know,” LLMs fabricate answers when retrieval misses the relevant chunks. The prompt template in the code above includes this instruction, but it’s worth verifying the LLM actually follows it for your specific model and content.

FAQ

How do I set up a RAG pipeline with LangChain?

To set up a RAG pipeline with LangChain, install langchain, langchain-openai, langchain-community, and a vector database integration like langchain-chroma. The pipeline has six pieces: document loader, text splitter, embedding model, vector store, retriever, and LCEL chain combining retrieval with an LLM. The complete code runs about 50 lines of Python. Most tuning happens in chunk size (start with 1000 tokens, 200 overlap), top-K retrieval (start with k=5), and embedding model choice (text-embedding-3-small is a strong default).

What vector database should I use with LangChain?

For development and prototyping, Chroma runs embedded in your Python process with no separate infrastructure – the easiest starting point. For production, Pinecone is the most popular managed service, Qdrant offers performance-focused self-hosting or managed cloud, Weaviate provides the richest feature set, and pgvector works when you already operate Postgres. LangChain’s abstractions make swapping vector stores a one-line change, so starting with Chroma during development and migrating to a production store later is a common and low-friction path.

What chunk size should I use for LangChain RAG?

The right chunk size depends on content type. For most use cases, 1000 tokens with 200-token overlap is a strong starting point. For dense technical content, 500-800 tokens preserves more retrieval precision. For narrative content where context matters across paragraphs, 1500-2000 tokens with larger overlap works better. The RecursiveCharacterTextSplitter respects natural boundaries (paragraphs, sentences) rather than cutting mid-word, which produces better chunks than fixed-character splitters. Always review the chunked output during development to verify key context stays together.

What’s the best embedding model for LangChain RAG?

OpenAI’s text-embedding-3-small is the best default embedding model for most LangChain RAG pipelines – cheap, fast, and accurate enough for general use cases. text-embedding-3-large is more accurate for nuanced semantic similarity but costs more. For domain-specific content (legal, medical, scientific), specialized embedding models from HuggingFace sometimes outperform general-purpose options. Cohere and Voyage AI offer competitive alternatives. The critical constraint: once you’ve embedded documents with one model, all queries must use the same model. Switching models requires re-embedding the entire store.

Should I use LangChain or build RAG from scratch?

Use LangChain when you want the document loaders, text splitters, retriever abstractions, and LCEL composition syntax to save time over rolling your own. The framework genuinely saves work on the integration plumbing. Build from scratch when you want full control over each step or when LangChain’s abstractions are getting in your way. For most teams shipping production RAG in 2026, LangChain remains the productive default. The fallback is rolling your own using just the OpenAI SDK and a vector database client, which works but reinvents most of what LangChain already provides.

If you’ve shipped a LangChain RAG pipeline in production and have honest numbers on what tuning actually moved your eval metrics, that writeup is the gap worth filling. Most RAG content covers the basic setup. Real reports on what changed quality in production are scarce.

Written by

Rohit Shukla

👋 Hi, I’m Rohit Shukla! I am a full-stack developer with expertise in Angular, Golang, Java, and I am passionate about building scalable applications, backend systems, and APIs. Over 4 the years, I have worked on various projects, improving my skills in modern web technologies, AI and cloud computing.