Codebase Vectorization 2026 — Semantic Code Search with Embeddings
Overview
Traditional code search tools like grep only match exact strings — they can’t find “function that validates email” if the codebase calls it checkEmailFormat. Vector embeddings solve this by converting code into semantic vectors, enabling search by meaning rather than literal text. This tutorial walks you through building a codebase vectorization pipeline that: chunks your repository into searchable units, generates embeddings with OpenAI’s text-embedding-3-large, stores them in pgvector (PostgreSQL), and provides a semantic search API. You’ll end up with a search system that can find relevant code from natural language queries, docstring descriptions, or even other code snippets. Performance target: sub-100ms search on a 100k-chunk repository.
Prerequisites
- Python 3.10+ and Node.js 18+
- PostgreSQL 15+ with pgvector extension (
CREATE EXTENSION vector;) - OpenAI API key with access to
text-embedding-3-large - A codebase to index (any language; we’ll use a small open-source project like
FastAPIas example) pip install openai psycopg2-binary numpy tiktoken- Docker (optional, for local PostgreSQL):
docker run -d -p 5432:5432 -e POSTGRES_PASSWORD=password ankane/pgvector
Step 1: Design the Chunking Strategy
Code cannot be chunked like plain text. Functions, classes, and imports must stay atomic.
import ast
import tree_sitter
from tree_sitter import Language, Parser
# Method 1: AST-based chunking for Python
def chunk_python_file(source_code, filepath):
"""Parse Python file and extract function/class boundaries."""
tree = ast.parse(source_code)
chunks = []
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
chunk_text = source_code[node.lineno-1:node.end_lineno] if hasattr(node, 'end_lineno') else \
ast.get_source_segment(source_code, node)
chunks.append({
"type": "function",
"name": node.name,
"file": filepath,
"start_line": node.lineno,
"end_line": getattr(node, 'end_lineno', node.lineno + 20),
"content": chunk_text
})
elif isinstance(node, ast.ClassDef):
methods_text = []
for item in node.body:
if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef)):
methods_text.append(f" def {item.name}(...): ...")
chunks.append({
"type": "class",
"name": node.name,
"file": filepath,
"start_line": node.lineno,
"content": f"class {node.name}:\n" + "\n".join(methods_text)
})
return chunks
# Method 2: token-count-aware chunking for any language
from transformers import GPT2TokenizerFast
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
def get_tokens(text):
return tokenizer.encode(text)
def chunk_by_tokens(code, filepath, max_tokens=500, overlap_tokens=50):
"""Split code into overlapping chunks by token count."""
lines = code.split('\n')
chunks = []
current_chunk = []
current_tokens = 0
for i, line in enumerate(lines):
line_tokens = len(get_tokens(line))
if current_tokens + line_tokens > max_tokens and current_chunk:
chunk_text = '\n'.join(current_chunk)
chunks.append({
"file": filepath,
"start_line": i - len(current_chunk) + 1,
"end_line": i,
"content": chunk_text,
"token_count": current_tokens
})
# Keep overlap tokens from end of previous chunk
overlap_content = current_chunk[-(overlap_tokens // 10):] # ~10 lines of overlap
current_chunk = overlap_content + [line]
current_tokens = sum(len(get_tokens(l)) for l in current_chunk)
else:
current_chunk.append(line)
current_tokens += line_tokens
# Last chunk
if current_chunk:
chunk_text = '\n'.join(current_chunk)
chunks.append({
"file": filepath,
"start_line": len(lines) - len(current_chunk) + 1,
"end_line": len(lines),
"content": chunk_text,
"token_count": current_tokens
})
return chunks
Step 2: Scan and Chunk Your Codebase
Build a recursive scanner that walks your repository:
import os
from pathlib import Path
def scan_codebase(root_dir, extensions={'.py', '.js', '.ts', '.jsx', '.tsx', '.go', '.rs', '.java'}):
"""Walk the codebase and chunk every file."""
exclude_dirs = {'node_modules', '__pycache__', '.git', 'venv', '.venv', 'dist', 'build', '.next', 'target'}
all_chunks = []
for path in Path(root_dir).rglob('*'):
if path.suffix not in extensions:
continue
if any(excl in path.parts for excl in exclude_dirs):
continue
if path.stat().st_size > 100000: # Skip files >100KB
continue
if path.stat().st_size < 10: # Skip empty/near-empty files
continue
try:
source = path.read_text(encoding='utf-8', errors='ignore')
# Use method 1 for Python, method 2 for everything else
if path.suffix == '.py':
chunks = chunk_python_file(source, str(path))
else:
chunks = chunk_by_tokens(source, str(path), max_tokens=500)
all_chunks.extend(chunks)
print(f" ✓ {path} → {len(chunks)} chunks")
except Exception as e:
print(f" ✗ {path}: {e}")
return all_chunks
# Example: scan FastAPI source
chunks = scan_codebase("./fastapi/fastapi")
print(f"Total chunks: {len(chunks)}")
# Expected: ~1500-3000 chunks for a medium codebase
Step 3: Generate Embeddings
Chunk → vector transformation is the core of semantic search:
from openai import OpenAI
import numpy as np
from tqdm import tqdm
client = OpenAI(api_key="sk-your-key-here")
def embed_chunks(chunks, batch_size=20):
"""Generate embeddings for code chunks in batches."""
EMBEDDING_MODEL = "text-embedding-3-large"
DIMENSIONS = 1536 # Use 1536 for speed; 3072 for max accuracy
for i in tqdm(range(0, len(chunks), batch_size)):
batch = chunks[i:i+batch_size]
texts = [c["content"] for c in batch]
response = client.embeddings.create(
model=EMBEDDING_MODEL,
input=texts,
dimensions=DIMENSIONS
)
for j, embedding_data in enumerate(response.data):
batch[j]["embedding"] = embedding_data.embedding
return chunks
# Store original content minus embedding for preview
chunks_with_embeddings = embed_chunks(chunks)
print(f"Embedding dimensionality: {len(chunks_with_embeddings[0]['embedding'])}")
Step 4: Store in PostgreSQL with pgvector
import psycopg2
from psycopg2.extras import execute_values
def setup_pgvector():
conn = psycopg2.connect(
host="localhost", port=5432, dbname="code_search",
user="postgres", password="password"
)
cur = conn.cursor()
# Enable pgvector extension
cur.execute("CREATE EXTENSION IF NOT EXISTS vector;")
# Create table with vector index
cur.execute("""
CREATE TABLE IF NOT EXISTS code_chunks (
id SERIAL PRIMARY KEY,
file_path TEXT NOT NULL,
start_line INTEGER,
end_line INTEGER,
chunk_type TEXT,
chunk_name TEXT,
content TEXT NOT NULL,
token_count INTEGER,
embedding vector(1536)
);
""")
# Create IVFFlat index for fast approximate search
cur.execute("""
CREATE INDEX IF NOT EXISTS idx_code_chunks_embedding
ON code_chunks
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
""")
conn.commit()
return conn
def insert_chunks(conn, chunks):
cur = conn.cursor()
data = [
(
c["file"], c.get("start_line"), c.get("end_line"),
c.get("type"), c.get("name"), c["content"],
c.get("token_count"), c["embedding"]
)
for c in chunks
]
execute_values(cur, """
INSERT INTO code_chunks
(file_path, start_line, end_line, chunk_type, chunk_name, content, token_count, embedding)
VALUES %s
""", data)
conn.commit()
print(f"Inserted {len(data)} chunks")
conn = setup_pgvector()
insert_chunks(conn, chunks_with_embeddings)
Step 5: Build the Semantic Search Engine
import numpy as np
from openai import OpenAI
client = OpenAI(api_key="sk-your-key-here")
def search_code(query, conn, top_k=10):
"""Search codebase by semantic similarity to a query."""
# Generate query embedding
response = client.embeddings.create(
model="text-embedding-3-large",
input=[query],
dimensions=1536
)
query_embedding = response.data[0].embedding
# Search with pgvector
cur = conn.cursor()
cur.execute("""
SELECT
file_path, start_line, end_line, chunk_type, chunk_name,
LEFT(content, 200) AS content_preview,
1 - (embedding <=> %s::vector) AS similarity
FROM code_chunks
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_embedding, query_embedding, top_k))
results = []
for row in cur.fetchall():
results.append({
"file": row[0],
"lines": f"{row[1]}-{row[2]}",
"type": row[3],
"name": row[4],
"preview": row[5],
"similarity": round(row[6], 4)
})
return results
# Test: search for email validation
results = search_code("validate email address format", conn)
for r in results:
print(f"{r['similarity']:.4f} {r['file']}:{r['lines']} ({r['type']}: {r['name']})")
# Expected: Finds functions like "validate_email" or "check_email_format"
Step 6: Add Hybrid Search (Vector + Keyword)
Pure semantic search can miss exact matches that have weak embedding similarity. Add BM25 keyword search as a complement:
# Option A: Use PostgreSQL full-text search combined with pgvector
def hybrid_search(query, conn, top_k=10):
"""Combined vector + keyword search."""
response = client.embeddings.create(
model="text-embedding-3-large",
input=[query],
dimensions=1536
)
query_embedding = response.data[0].embedding
cur = conn.cursor()
cur.execute("""
WITH semantic AS (
SELECT id, file_path, start_line, content,
1 - (embedding <=> %s::vector) AS score,
'semantic' AS match_type
FROM code_chunks
ORDER BY embedding <=> %s::vector
LIMIT %s
),
keyword AS (
SELECT id, file_path, start_line, content,
ts_rank(to_tsvector('english', content), plainto_tsquery('english', %s)) AS score,
'keyword' AS match_type
FROM code_chunks
WHERE to_tsvector('english', content) @@ plainto_tsquery('english', %s)
ORDER BY score DESC
LIMIT %s
)
SELECT * FROM semantic
UNION ALL
SELECT * FROM keyword
WHERE id NOT IN (SELECT id FROM semantic)
ORDER BY score DESC
LIMIT %s;
""", (query_embedding, query_embedding, top_k, query, query, top_k, top_k))
return cur.fetchall()
# Test hybrid search
results = hybrid_search("HTTP exception handler", conn)
print(f"Found {len(results)} results combining semantic and keyword matching")
Step 7: Build a FastAPI Search Service
Wrap everything in an API:
from fastapi import FastAPI, Query
from pydantic import BaseModel
import psycopg2
app = FastAPI(title="Codebase Vector Search")
conn = psycopg2.connect(host="localhost", dbname="code_search",
user="postgres", password="password")
class SearchResult(BaseModel):
file: str
lines: str
type: str | None = None
name: str | None = None
preview: str
similarity: float
@app.get("/search", response_model=list[SearchResult])
async def search(q: str = Query(..., description="Natural language query"),
top_k: int = Query(10, ge=1, le=50)):
results = search_code(q, conn, top_k=top_k)
return results
@app.post("/index")
async def index_repo(repo_path: str):
"""Re-index the codebase."""
chunks = scan_codebase(repo_path)
chunks_with_emb = embed_chunks(chunks)
# Clear and re-insert
cur = conn.cursor()
cur.execute("TRUNCATE code_chunks")
conn.commit()
insert_chunks(conn, chunks_with_emb)
return {"message": f"Indexed {len(chunks_with_emb)} chunks"}
@app.get("/status")
async def status():
cur = conn.cursor()
cur.execute("SELECT COUNT(*) FROM code_chunks")
count = cur.fetchone()[0]
return {"total_chunks": count}
# Run with: uvicorn search_api:app --reload --port 8000
What You’ve Built
You have a complete codebase vectorization pipeline:
- Recursive code scanner with language-aware chunking (AST for Python, token-based for others)
- OpenAI embedding generation at batch scale (1536-dimensional vectors)
- pgvector storage with IVFFlat indexing for sub-100ms search
- Hybrid semantic + keyword search engine
- REST API serving search results from any natural language query
Search a 10k-file codebase takes under 200ms from query to results.
Troubleshooting
IVFFlat index doesn’t improve search speed:
The lists parameter in WITH (lists = 100) should be set to sqrt(n) where n is the number of rows. For 10,000 chunks, use lists = 100. For 100,000 chunks, use lists = 300. Rebuild the index after tuning: REINDEX INDEX idx_code_chunks_embedding;.
OpenAI embedding API returns 429 rate limit errors:
Implement exponential backoff: reduce batch_size to 10 and add time.sleep(0.5) between batches. For production, use the openai.Retry strategy with max_retries=5.
Large files (>100KB) cause token limit errors: The scanner already skips files >100KB. For legitimate large source files, split into top-level declarations only rather than chunking every function body.
Semantic search returns irrelevant results for very short queries: Short queries like “auth” lack semantic context. Enrich them programmatically: prepend the most common code language patterns. “auth” → “authentication, authorization, login, password handling code”.
Next Steps
- Add deduplication with cosine similarity threshold (>0.95 = duplicate)
- Implement streaming re-indexing with a Git hook (
post-committriggers re-index of changed files) - Integrate with VS Code as a custom search provider extension
- Use a local embedding model like
mxbai-embed-largefor offline search - Build a RAG system on top: “Explain this bug using context from our codebase”