How to Build an AI Research Assistant with LangChain & OpenAI 2026
How to Build an AI Research Assistant with LangChain & OpenAI 2026
Overview
This tutorial builds a complete AI research assistant that can:
- Search the web for up-to-date information
- Read and analyze PDF documents
- Summarize findings across multiple sources
- Generate structured research reports
Total code: ~150 lines of Python. Time to build: 45 minutes.
Prerequisites
- Python 3.11+
- OpenAI API key (or Anthropic/Gemini)
- Basic Python knowledge
- pip installed
Step 1: Setup Environment
mkdir research-assistant && cd research-assistant
python -m venv .venv && source .venv/bin/activate
pip install langchain langchain-community langchain-openai langgraph \
pypdf2 chromadb tiktoken duckduckgo-search
Create a .env file:
OPENAI_API_KEY=sk-your-key-here
LANGCHAIN_TRACING_V2=true # optional, for debugging
Step 2: Core Research Agent
# research_assistant.py
import os
from typing import TypedDict, Annotated, List
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from langchain_community.tools import DuckDuckGoSearchRun
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
# Initialize LLM
llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
search_tool = DuckDuckGoSearchRun()
# State definition
class ResearchState(TypedDict):
query: str
web_results: Annotated[List[str], "web_search_results"]
pdf_content: Annotated[List[str], "pdf_documents"]
analysis: str
report: str
Step 3: Web Research Node
def web_research_node(state: ResearchState) -> dict:
"""Search the web and collect top results."""
query = state["query"]
results = []
# Multiple search variations for depth
queries = [
query,
f"{query} 2026 latest",
f"{query} analysis review"
]
for q in queries:
try:
result = search_tool.run(q)
results.append(f"Query: {q}\n{result[:2000]}")
except Exception as e:
results.append(f"Search failed for '{q}': {str(e)}")
return {"web_results": results}
Step 4: PDF Analysis Node
def pdf_analysis_node(state: ResearchState) -> dict:
"""Load and process PDF documents."""
pdf_dir = "./pdfs"
if not os.path.exists(pdf_dir):
return {"pdf_content": ["No PDF directory found. Create ./pdfs/ and add PDFs."]}
documents = []
for pdf_file in os.listdir(pdf_dir):
if pdf_file.endswith(".pdf"):
loader = PyPDFLoader(os.path.join(pdf_dir, pdf_file))
pages = loader.load()
# Split into chunks for embedding
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, chunk_overlap=200
)
chunks = splitter.split_documents(pages)
# Store in vector DB for retrieval
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=OpenAIEmbeddings(),
persist_directory="./chroma_db"
)
# Retrieve relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
relevant = retriever.get_relevant_documents(state["query"])
content = "\n".join([doc.page_content for doc in relevant])
documents.append(f"From {pdf_file}:\n{content[:3000]}")
return {"pdf_content": documents}
Step 5: Analysis Node
def analysis_node(state: ResearchState) -> dict:
"""Synthesize web results and PDF content into analysis."""
context_parts = []
if state.get("web_results"):
context_parts.append("=== WEB RESEARCH ===")
context_parts.extend(state["web_results"])
if state.get("pdf_content"):
context_parts.append("=== PDF ANALYSIS ===")
context_parts.extend(state["pdf_content"])
context = "\n\n".join(context_parts)
analysis_prompt = f"""
Research Query: {state['query']}
Research Context:
{context[:10000]}
Provide:
1. Key findings (3-5 bullet points)
2. Conflicting information or gaps
3. Confidence assessment (high/medium/low) for each finding
4. Sources that corroborate each other
"""
result = llm.invoke(analysis_prompt)
return {"analysis": result.content}
Step 6: Report Generation Node
def report_node(state: ResearchState) -> dict:
"""Generate structured research report."""
report_prompt = f"""
Based on this research analysis, generate a structured report:
Analysis: {state['analysis'][:8000]}
Report structure:
## Executive Summary
## Key Findings
## Detailed Analysis
## Methodology
## Limitations
## Sources
"""
result = llm.invoke(report_prompt)
return {"report": result.content}
Step 7: Build the Graph
# Create the workflow graph
workflow = StateGraph(ResearchState)
# Add nodes
workflow.add_node("web_research", web_research_node)
workflow.add_node("pdf_analysis", pdf_analysis_node)
workflow.add_node("analysis", analysis_node)
workflow.add_node("report", report_node)
# Set entry point
workflow.set_entry_point("web_research")
# Add edges (parallel web + PDF → analysis → report)
workflow.add_edge("web_research", "analysis")
workflow.add_edge("pdf_analysis", "analysis")
workflow.add_edge("analysis", "report")
workflow.add_edge("report", END)
# Compile
app = workflow.compile()
Step 8: Run It
# Usage example
def run_research(query: str, pdf_dir: str = "./pdfs"):
initial_state = ResearchState(
query=query,
web_results=[],
pdf_content=[],
analysis="",
report=""
)
result = app.invoke(initial_state)
# Save report
with open("research_report.md", "w") as f:
f.write(result["report"])
print(f"✅ Report saved to research_report.md")
return result
# Example
run_research("Latest developments in solid-state battery technology 2026")
Output Example (Abridged)
## Executive Summary
Solid-state battery technology reached commercial viability in H1 2026. Three companies—Toyota, CATL, and QuantumScape—announced production-ready solid-state cells with energy densities exceeding 500 Wh/kg.
## Key Findings
1. **Toyota** began limited production of solid-state batteries for hybrids in April 2026 (source: Reuters)
2. **CATL** demonstrated a 520 Wh/kg solid-state pouch cell at Auto China 2026 (source: CATL press release)
3. **QuantumScape** reached 800 cycles with 95% capacity retention (source: Nature Energy paper)
4. Estimated cost: $75/kWh at scale vs $110/kWh for current Li-ion (source: BNEF)
5. Mass consumer EV adoption expected 2028-2029 (source: multiple analysts)
Tips
- Start with 1-2 sources, then expand. The research quality scales with source diversity.
- Add custom tools — integrate Slack, Notion, or Google Drive for organizational research.
- Caching — ChromaDB persists between runs. Use the same vector store to avoid re-processing PDFs.
- Rate limits — Web search tools have rate limits. Add delays for batch research.
FAQ
Q: Can I use this with Claude or Gemini instead of OpenAI?
A: Yes. Swap ChatOpenAI with ChatAnthropic or ChatGoogleGenerativeAI. Update embeddings accordingly.
Q: How do I deploy this as a web app? A: Wrap the research function in a FastAPI endpoint and add a React frontend. Or use LangServe for auto-generated API.
Q: What about cost? A: Each research run costs approximately ¥7-14 ($1-2) in API calls depending on source volume. PDF analysis is the most expensive step.
Q: Can I add citation tracking?
A: Yes, modify the web_research_node to return URLs alongside content, and include them in the report generation prompt.