LangChain is the most widely-used Python framework for building LLM-powered applications. It provides the abstractions, integrations, and tools needed to build everything from a simple Q&A chatbot to a complex multi-agent system that can research, analyze, and generate reports autonomously. As of 2025, LangChain has over 95,000 GitHub stars and is used by thousands of production applications.
This guide covers the core concepts you need to be productive with LangChain, with real code you can copy and run. We'll build up to a complete RAG (Retrieval Augmented Generation) system — the most practically valuable architecture for most LLM applications.
What Is LangChain?
LangChain is a framework that simplifies the development of LLM applications by providing:
Composable Chains
String together LLM calls, data transformations, and tool calls using LCEL (LangChain Expression Language) — a clean, readable pipe syntax.
Agent Framework
Build autonomous agents that can use tools (search, code execution, APIs) and reason step-by-step using the ReAct pattern.
300+ Integrations
Native connectors for every major LLM provider, vector store, document loader, and tool — so you're not writing boilerplate HTTP calls.
Installation & Setup
# Install core packages
pip install langchain langchain-openai langchain-community chromadb
# Set your API key
import os
os.environ["OPENAI_API_KEY"] = "your-key-here"
# Or set it in a .env file and use: from dotenv import load_dotenv; load_dotenv()
The LangChain package ecosystem is split into modular sub-packages as of v0.2+: langchain (core), langchain-openai (OpenAI-specific), langchain-anthropic (Claude), and langchain-community (community integrations). Install only what you need to keep your dependency footprint small.
LCEL: LangChain Expression Language
LCEL is LangChain's modern chain composition syntax using Python's pipe (|) operator. It's the recommended way to build chains and produces cleaner code than the older legacy chain classes.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
# Initialize the model
llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
# Create a prompt template
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant that explains technical concepts simply."),
("human", "Explain {concept} as if I'm 12 years old. Keep it under 100 words.")
])
# Build the chain using LCEL pipe syntax
chain = prompt | llm | StrOutputParser()
# Invoke the chain
result = chain.invoke({"concept": "neural networks"})
print(result)
The pipe operator creates a Runnable — a composable unit that can be invoked, batched, or streamed. Every component (prompt, model, parser) implements the same interface, making them trivially composable. You can also stream the output character by character:
# Streaming output (great for chat interfaces)
for chunk in chain.stream({"concept": "vector databases"}):
print(chunk, end="", flush=True)
Memory: Maintaining Conversation State
By default, LLMs are stateless — each call is independent. LangChain's memory modules add conversation history. The most common pattern in 2025 uses RunnableWithMessageHistory:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful coding assistant."),
MessagesPlaceholder(variable_name="history"), # injects conversation history
("human", "{input}")
])
chain = prompt | llm | StrOutputParser()
# Store histories per session
session_store = {}
def get_session_history(session_id: str):
if session_id not in session_store:
session_store[session_id] = ChatMessageHistory()
return session_store[session_id]
# Wrap chain with memory
chain_with_history = RunnableWithMessageHistory(
chain,
get_session_history,
input_messages_key="input",
history_messages_key="history"
)
# Use the same session_id to maintain context
config = {"configurable": {"session_id": "user-123"}}
response1 = chain_with_history.invoke({"input": "What is a decorator in Python?"}, config=config)
response2 = chain_with_history.invoke({"input": "Can you show me an example?"}, config=config)
# response2 understands "example" refers to the decorator from response1
For production, replace ChatMessageHistory (in-memory, lost on restart) with a persistent backend: RedisChatMessageHistory, DynamoDBChatMessageHistory, or PostgresChatMessageHistory.
Building a RAG System
Retrieval Augmented Generation (RAG) is the most valuable LangChain pattern for real applications. Instead of relying solely on the LLM's training data, RAG fetches relevant documents from your own data store and includes them in the context before generation. This enables accurate, cited answers over private documents, up-to-date information, and knowledge bases too large for a single context window.
📥 Document Loading & Chunking
Load your documents and split them into chunks small enough to fit in context. The chunk size / overlap balance matters: larger chunks preserve more context, smaller chunks enable more precise retrieval.
from langchain_community.document_loaders import PyPDFLoader, TextLoader, WebBaseLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
# Load a PDF
loader = PyPDFLoader("company_handbook.pdf")
documents = loader.load()
# Load from URL
web_loader = WebBaseLoader("https://docs.example.com/api-reference")
web_docs = web_loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # characters per chunk
chunk_overlap=200, # overlap to preserve context across chunks
length_function=len,
separators=["\n\n", "\n", " ", ""] # split on paragraphs first, then lines, then words
)
chunks = splitter.split_documents(documents)
🗄️ Vector Store Indexing
Convert document chunks to embeddings (dense numerical vectors) and store them in a vector database. Embeddings capture semantic meaning — chunks about similar topics end up near each other in vector space, enabling semantic search.
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Create embeddings using OpenAI's text-embedding-3-small (cost-effective)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Create and persist a Chroma vector store
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db" # persist to disk
)
# For subsequent runs, load the existing store:
# vectorstore = Chroma(persist_directory="./chroma_db", embedding_function=embeddings)
# Create a retriever — finds the top-k most semantically similar chunks
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance: balances similarity + diversity
search_kwargs={"k": 5} # return 5 most relevant chunks
)
🔗 RAG Chain Assembly
Connect the retriever to the LLM using LCEL. The chain retrieves relevant chunks, formats them into a prompt with context, and generates an answer grounded in your documents.
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
llm = ChatOpenAI(model="gpt-4o", temperature=0)
rag_prompt = ChatPromptTemplate.from_messages([
("system", """You are a helpful assistant. Answer questions based ONLY on the provided context.
If the answer isn't in the context, say 'I don't have information about that in the provided documents.'
Always cite which part of the context your answer comes from.
Context:
{context}"""),
("human", "{question}")
])
def format_docs(docs):
return "\n\n---\n\n".join(
f"Source: {doc.metadata.get('source', 'Unknown')}, Page: {doc.metadata.get('page', 'N/A')}\n{doc.page_content}"
for doc in docs
)
# Full RAG chain
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| rag_prompt
| llm
| StrOutputParser()
)
# Query your documents
answer = rag_chain.invoke("What is the company's remote work policy?")
print(answer)
📊 Evaluation & Improvement
A RAG system is only as good as its retrieval quality. Use LangSmith or a simple manual evaluation loop to measure retrieval precision and answer faithfulness — this catches issues like chunks being too small/large or the wrong embedding model being used.
# Simple retrieval quality test
test_questions = [
"What is the vacation policy?",
"How do I expense travel?",
"Who approves performance reviews?"
]
for question in test_questions:
retrieved_docs = retriever.invoke(question)
print(f"\nQuestion: {question}")
print(f"Top retrieved chunk: {retrieved_docs[0].page_content[:200]}...")
print(f"Total chunks retrieved: {len(retrieved_docs)}")
# Manually verify: does the top chunk actually contain the answer?
LangChain Agents
LangChain agents use an LLM to decide which tools to use and in what order, dynamically adapting to the task. The new recommended approach uses create_react_agent from LangGraph:
from langchain import hub
from langchain_openai import ChatOpenAI
from langchain_community.tools import DuckDuckGoSearchRun
from langchain_core.tools import tool
from langgraph.prebuilt import create_react_agent
# Define tools
search = DuckDuckGoSearchRun()
@tool
def calculate(expression: str) -> str:
"""Evaluate a mathematical expression. Input should be a valid Python math expression."""
try:
return str(eval(expression))
except Exception as e:
return f"Error: {e}"
tools = [search, calculate]
# Create the agent
llm = ChatOpenAI(model="gpt-4o", temperature=0)
agent = create_react_agent(llm, tools)
# Run the agent
response = agent.invoke({
"messages": [("human", "What is the current price of NVIDIA stock, and what percentage change is that from $100?")]
})
print(response["messages"][-1].content)
The agent will autonomously call the search tool to find the current price, then call the calculate tool to compute the percentage — all without you specifying the order of operations.
LangChain vs LlamaIndex vs Haystack
| Feature | LangChain | LlamaIndex | Haystack |
|---|---|---|---|
| Primary strength | Agent workflows, chains, general-purpose | Data indexing, RAG | Enterprise search pipelines |
| RAG support | ✅ Excellent | ✅ Best-in-class | ✅ Good |
| Agent framework | ✅ Excellent (LangGraph) | ✅ Good | ⚠️ Limited |
| Observability | ✅ LangSmith | ✅ LlamaTrace | ✅ Hayhooks |
| Learning curve | Medium (rapid API changes historically) | Low-Medium | High |
| Community & docs | ⭐⭐⭐⭐⭐ Largest | ⭐⭐⭐⭐ Large | ⭐⭐⭐ Good |
| Best for | Most LLM applications | Document Q&A, knowledge bases | Enterprise search, pipelines |
When to use LlamaIndex instead: If your primary use case is building a Q&A system over documents or a knowledge base, LlamaIndex's data indexing abstractions and query engines are more purpose-built and often easier to get right than LangChain's more general-purpose approach. The two frameworks are not mutually exclusive — you can use LlamaIndex for retrieval and LangChain for chain/agent orchestration.
Production Deployment Tips
- Use LangSmith for observability: Sign up free at smith.langchain.com. Set
LANGCHAIN_TRACING_V2=trueandLANGCHAIN_API_KEYin your environment — every chain invocation is automatically logged with full input/output traces, latency, and token usage. Essential for debugging production issues. - Cache LLM responses: Use
InMemoryCache(development) orRedisCache(production) to cache identical LLM calls. Semantic caching (viaGPTCache) can cache semantically similar questions. Reduces latency and API costs dramatically for high-traffic Q&A applications. - Stream responses to users: Always use
chain.astream()in async web frameworks (FastAPI, Django async) for chat interfaces. Users experience first-token latency instead of waiting for the full response — dramatically better UX. - Separate indexing from retrieval: Run document ingestion (loading, chunking, embedding, storing) as an offline batch job, not inline with user requests. Indexing is slow and expensive; retrieval should be fast (<200ms). Use a dedicated ingestion pipeline that updates the vector store independently.
- Pin your LangChain version: LangChain has historically made breaking changes between minor versions. Always pin exact versions in your requirements.txt (
langchain==0.2.16) and test before upgrading in production.
Frequently Asked Questions
EnsembleRetriever; (3) Re-ranking — use Cohere's Rerank API to re-score retrieved chunks before passing to the LLM; (4) Query expansion — use an LLM to generate multiple versions of the user's question and retrieve for each; (5) Metadata filtering — add filters to restrict retrieval to relevant document sections.