How to set up a RAG pipeline with LangChain and a vector database is one of those questions where the working code is short but the right tuning decisions actually shape whether the pipeline produces useful answers. LangChain has been the dominant Python framework for RAG since 2023 and remains the most popular choice in 2026, mostly because the abstractions for document loading, chunking, embeddings, and retrieval do enough common work that you don’t reinvent them. A working RAG pipeline with LangChain takes about 50 lines of Python.
I’ve shipped LangChain RAG pipelines for a handful of projects over the past year – internal docs Q&A, customer support knowledge bases, code-aware research tools. The first one took me a week because I didn’t yet understand which knobs mattered. The most recent took an afternoon because the structure is the same in nearly every project. What follows is the working tutorial: the LangChain RAG pipeline architecture, the complete code with explanations, the tuning decisions that actually affect quality, and the production gotchas worth knowing.
Quick answer: building a LangChain RAG pipeline
A LangChain RAG pipeline has six pieces: a document loader (PDFs, web pages, etc.), a text splitter for chunking, an embedding model (OpenAI, Cohere, or HuggingFace), a vector database (Chroma for local, Pinecone or Weaviate for production), a retriever, and an LCEL chain combining retrieval with an LLM. The full implementation takes about 50 lines of Python. Most production complexity comes from tuning chunk size, picking the right embedding model, and choosing the vector database that fits your scale and budget.
The LangChain RAG pipeline architecture
Before the code, the pieces and how they connect. A RAG pipeline runs in two phases. The ingestion phase happens once when you add or update documents. The query phase runs every time a user asks a question.
In LangChain terms, the ingestion phase uses document loaders to read source content, text splitters to break it into chunks of manageable size, an embedding model to convert each chunk into a vector, and a vector store to persist those vectors with metadata.
The query phase uses a retriever wrapped around the vector store to find the most relevant chunks for an incoming question, a prompt template to construct the LLM input, an LLM to generate the response, and an LCEL chain (LangChain Expression Language) that wires all of these together with the | operator.
The whole pipeline lives between the user’s question and the LLM’s answer. LangChain’s job is to make each piece swappable – you can change the embedding model, swap the vector store, or replace the LLM without rewriting the rest of the code.
Install dependencies
The minimum dependencies for a LangChain RAG pipeline:
pip install langchain langchain-openai langchain-community langchain-chroma pypdf
Set your API key:
export OPENAI_API_KEY="your-key-here"
This setup uses OpenAI for embeddings and the LLM, and Chroma as the local vector store. Swapping any of these is a one-line change once the pipeline is working.
The complete LangChain RAG pipeline
Here’s the full working pipeline. The code below loads a PDF, chunks it, embeds the chunks, stores them in Chroma, and answers questions against the stored content.
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
# 1. Load documents
loader = PyPDFLoader("document.pdf")
docs = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
)
chunks = splitter.split_documents(docs)
# 3. Embed and store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./rag_db",
)
# 4. Build retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
# 5. Construct prompt
prompt = ChatPromptTemplate.from_messages([
("system", "Answer based on the provided context. If the context doesn't contain the answer, say so."),
("user", "Context:\n{context}\n\nQuestion: {question}"),
])
# 6. Build the LCEL chain
llm = ChatOpenAI(model="gpt-4o")
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# 7. Query the pipeline
answer = rag_chain.invoke("What is the main topic of this document?")
print(answer)
That’s the whole pipeline. Each numbered section maps to one piece of the architecture above. Run it once to ingest the document, then call rag_chain.invoke() for each question. The Chroma persistence directory means the second run skips the ingestion phase and just queries the existing store.
A few notes on what’s happening. The RecursiveCharacterTextSplitter splits text on natural boundaries (paragraphs, sentences) rather than fixed character counts, which produces more semantically coherent chunks. The k=5 parameter on the retriever returns the top 5 chunks per query – a reasonable starting point. The LCEL chain syntax (retriever | format_docs | prompt | llm | parser) wires the pieces together using the | operator, which is LangChain’s expression language for composition.
Choosing chunk size and embedding model
Two tuning decisions affect RAG quality more than people expect: chunk size and embedding model. Getting these wrong produces a pipeline that runs but doesn’t answer well.
Chunk size trades off between context completeness and retrieval precision. Small chunks (200-500 tokens) match queries precisely but lose surrounding context that might be needed for good answers. Large chunks (2000+ tokens) preserve context but reduce retrieval precision because more irrelevant content gets pulled in. The 1000-token chunk size in the code above is a reasonable starting point for most use cases, with 200-token overlap to preserve context across boundaries. For dense technical content, go shorter (500-800 tokens). For narrative content where context matters more, go longer (1500-2000 tokens).
Embedding model choice matters because the model determines what “similar” means for your content. OpenAI’s text-embedding-3-small is a strong default – cheap, fast, and good enough for most use cases. text-embedding-3-large is more accurate for nuanced semantic similarity but costs more. For domain-specific content (legal, medical, technical), specialized embedding models from HuggingFace sometimes outperform general-purpose options. Cohere and Voyage AI offer competitive alternatives if OpenAI isn’t an option.
The mistake to avoid: mixing embedding models. Once you’ve embedded your documents with one model, queries must use the same model. Changing the model means re-embedding everything in your store.
Choosing a vector database
Chroma is the right pick for development and prototyping because it runs embedded in your Python process with no separate infrastructure. The code above uses Chroma with persistence to disk, which is enough for early-stage projects and projects with modest vector counts.
For production deployments, the credible options break into three groups. Managed services like Pinecone, Weaviate Cloud, and Qdrant Cloud handle the infrastructure for you at the cost of recurring fees. Self-hosted open-source options like Qdrant, Weaviate, and Milvus give you full control if you can operate them. Postgres-based options like pgvector are the right fit when you already operate Postgres and want to avoid adding a new system.
Swapping the vector store in the code above is a one-line change. Replace Chroma.from_documents(...) with Pinecone.from_documents(...) or the equivalent for your chosen store, and the rest of the pipeline keeps working unchanged. This is the main benefit of LangChain’s abstraction layer – the vector database becomes a swappable component rather than a permanent architecture decision.
For most projects, the realistic path is starting with Chroma during development, then migrating to a production vector store once the pipeline is working end-to-end. The migration is usually faster than expected because LangChain’s interfaces hide most of the database-specific differences.
Common LangChain RAG pipeline gotchas
A few production-relevant pitfalls show up consistently in real deployments.
Chunk boundaries that split key context. The default text splitter respects paragraphs and sentences but can still cut a critical sentence in half if the chunk size limit hits mid-paragraph. Review chunked output during development to make sure important content stays together.
Top-K too small or too large. The k=5 default returns five chunks per query. Some queries need fewer (precise factual lookups), others need more (synthesis across many documents). Tune k against your actual eval set rather than defaulting.
Embedding model drift. If you upgrade the embedding model later, you must re-embed every document. The vectors from different models occupy different semantic spaces and don’t mix.
Prompt template assumptions about source citations. If you want the LLM to cite sources, the prompt template needs to instruct it to do so and the chunk metadata needs to include source information. The default flow doesn’t include citations by design.
Hallucinations on missing-context queries. Without explicit prompt instruction to say “I don’t know,” LLMs fabricate answers when retrieval misses the relevant chunks. The prompt template in the code above includes this instruction, but it’s worth verifying the LLM actually follows it for your specific model and content.
FAQ
If you’ve shipped a LangChain RAG pipeline in production and have honest numbers on what tuning actually moved your eval metrics, that writeup is the gap worth filling. Most RAG content covers the basic setup. Real reports on what changed quality in production are scarce.