Building Enterprise Knowledge Graphs from Unstructured Data 2026
Overview
Most enterprise data is unstructured — PDFs, emails, meeting transcripts, internal wikis — locked in formats that machines can’t reason about. A knowledge graph bridges this gap by extracting entities (people, companies, products, concepts) and their relationships into a queryable graph structure. This tutorial guides you through building an enterprise knowledge graph from scratch: extracting entities from documents using GPT-4o, resolving duplicates with embedding similarity, storing relationships in Neo4j, and querying the graph with natural language. You’ll process real-world sources like company financial filings, product documentation, and internal wikis. The final system supports questions like “Which suppliers have contracts expiring in Q3 and how are they connected to our top customers?”
Prerequisites
- Python 3.10+
- Neo4j 5.x (local or AuraDB cloud — free tier sufficient for 100k nodes)
- OpenAI API key (GPT-4o for entity extraction, text-embedding-3-large for dedup)
- Sample documents: 5-10 PDFs or markdown files (e.g., annual reports, product specs)
pip install neo4j openai langchain langchain-openai spacy pandas- Docker (optional, for local Neo4j):
docker run -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password neo4j:5 - Download spaCy model:
python -m spacy download en_core_web_lg
Step 1: Design the Graph Schema
Before extracting data, define your ontology — what entities and relationships matter.
// Enterprise knowledge graph schema
// Node labels:
// - Person
// - Organization
// - Product
// - Document
// - Concept
// - Project
// - Technology
//
// Relationship types:
// - WORKS_FOR (Person → Organization)
// - MENTIONS (Document → Person | Organization | Product | Concept)
// - DEPENDS_ON (Product → Technology)
// - COMPETES_WITH (Organization → Organization)
// - PARTNERS_WITH (Organization → Organization)
// - LEADS (Person → Project)
// - PRODUCES (Organization → Product)
// - USES (Organization → Product | Technology)
// - CITED_IN (Document → Document)
//
// Example: Create a small test node
CREATE (n:Concept {name: "Knowledge Graph", description: "A structured representation of entities and their relationships"})
RETURN n
Step 2: Extract Entities and Relationships from Documents
Use LLM-powered entity extraction to parse unstructured documents:
from openai import OpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from typing import TypedDict, List
import json
client = OpenAI()
llm = ChatOpenAI(model="gpt-4o", temperature=0)
EXTRACTION_PROMPT = """You are an expert knowledge graph entity extractor.
Analyze the following document text and extract:
1. **Entities**: People, Organizations, Products, Technologies, Concepts, Projects
2. **Relationships**: Meaningful connections between entities with types from this ontology:
{ontology}
3. **Confidence**: A score from 0.0 to 1.0 for each extraction
Return a JSON object with this exact structure:
{{
"entities": [
{{"name": "string", "type": "Person|Organization|Product|Technology|Concept|Project",
"aliases": ["string"], "mentions": 5, "context": "string"}}
],
"relationships": [
{{"source": "string", "target": "string", "type": "string",
"evidence": "string from document", "confidence": 0.95}}
]
}}
Document text:
{document_text}
"""
def extract_entities_from_document(text, ontology_description, chunk_size=8000):
"""Extract entities and relationships from a document chunk."""
all_entities = []
all_relationships = []
# Process in chunks for long documents
chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]
for i, chunk in enumerate(chunks):
prompt = ChatPromptTemplate.from_messages([
("system", EXTRACTION_PROMPT),
("user", "{document_text}")
])
chain = prompt | llm
try:
result = chain.invoke({
"ontology": ontology_description,
"document_text": chunk
})
# Parse JSON from response
content = result.content
# Extract JSON if wrapped in markdown code blocks
if "```json" in content:
content = content.split("```json")[1].split("```")[0]
elif "```" in content:
content = content.split("```")[1].split("```")[0]
parsed = json.loads(content)
all_entities.extend(parsed.get("entities", []))
all_relationships.extend(parsed.get("relationships", []))
except Exception as e:
print(f"Chunk {i} extraction failed: {e}")
return all_entities, all_relationships
# Test with a sample document
sample_text = """
Acme Corporation announced a strategic partnership with DataFlow Inc. to integrate
their AI-powered analytics platform. Sarah Chen, CTO of Acme, will lead the integration
project called "Project Nexus". The platform uses AWS Bedrock for LLM inference and
PostgreSQL for data storage. Acme's main competitor in this space is Vertex Analytics.
"""
ontology = """
- WORKS_FOR (Person → Organization)
- PARTNERS_WITH (Organization → Organization)
- LEADS (Person → Project)
- USES (Organization → Product|Technology)
- COMPETES_WITH (Organization → Organization)
- PRODUCES (Organization → Product)
"""
entities, relationships = extract_entities_from_document(sample_text, ontology)
print(f"Entities: {json.dumps(entities, indent=2)}")
print(f"Relationships: {json.dumps(relationships, indent=2)}")
Step 3: Resolve Duplicate Entities with Embedding Similarity
The same entity may appear as “OpenAI”, “Open AI Inc.”, “OpenAI Corporation”. Deduplicate:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
def get_embedding(text, model="text-embedding-3-large"):
"""Generate embedding for an entity name."""
response = client.embeddings.create(model=model, input=[text], dimensions=1536)
return response.data[0].embedding
def resolve_duplicates(entities, threshold=0.85):
"""Merge entities that likely refer to the same thing."""
# Generate embeddings for all entity names + aliases
entity_names = [e["name"] for e in entities]
all_names = entity_names.copy()
# Add aliases to the embedding pool
alias_map = {} # alias → original entity index
for i, e in enumerate(entities):
for alias in e.get("aliases", []):
all_names.append(alias)
alias_map[alias] = i
embeddings = [get_embedding(name) for name in all_names]
entity_embeddings = embeddings[:len(entity_names)]
# Build clusters using cosine similarity
clusters = []
assigned = set()
for i, entity in enumerate(entities):
if i in assigned:
continue
cluster = [entity]
assigned.add(i)
for j in range(i + 1, len(entities)):
if j in assigned:
continue
sim = cosine_similarity(
[entity_embeddings[i]],
[entity_embeddings[j]]
)[0][0]
if sim > threshold:
cluster.append(entities[j])
assigned.add(j)
clusters.append(cluster)
# Merge each cluster into a single entity
merged_entities = []
for cluster in clusters:
merged = {
"name": cluster[0]["name"], # Use first mention's name
"type": max(set(e["type"] for e in cluster), key=lambda t: sum(1 for e in cluster if e["type"] == t)),
"aliases": list(set(
[e["name"] for e in cluster[1:]] +
sum([e.get("aliases", []) for e in cluster], [])
)),
"mentions": sum(e.get("mentions", 1) for e in cluster),
"context": cluster[0].get("context", "")
}
merged_entities.append(merged)
return merged_entities
# Deduplicate
deduped_entities = resolve_duplicates(entities)
print(f"Before dedup: {len(entities)} entities → After: {len(deduped_entities)}")
Step 4: Load Entities and Relationships into Neo4j
from neo4j import GraphDatabase
class KnowledgeGraphBuilder:
def __init__(self, uri, user, password):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def create_constraints(self):
"""Create uniqueness constraints for efficient upserts."""
with self.driver.session() as session:
constraints = [
"CREATE CONSTRAINT IF NOT EXISTS FOR (p:Person) REQUIRE p.name IS UNIQUE",
"CREATE CONSTRAINT IF NOT EXISTS FOR (o:Organization) REQUIRE o.name IS UNIQUE",
"CREATE CONSTRAINT IF NOT EXISTS FOR (p:Product) REQUIRE p.name IS UNIQUE",
"CREATE CONSTRAINT IF NOT EXISTS FOR (c:Concept) REQUIRE c.name IS UNIQUE",
"CREATE CONSTRAINT IF NOT EXISTS FOR (t:Technology) REQUIRE t.name IS UNIQUE",
]
for c in constraints:
session.run(c)
def merge_entity(self, entity):
"""Upsert an entity node."""
type_label = entity["type"]
query = f"""
MERGE (n:{type_label} {{name: $name}})
ON CREATE SET n.aliases = $aliases, n.mentions = $mentions, n.first_seen = timestamp()
ON MATCH SET n.mentions = n.mentions + $mentions, n.aliases = apoc.coll.union(n.aliases, $aliases)
RETURN n.name
"""
with self.driver.session() as session:
session.run(query,
name=entity["name"],
aliases=entity.get("aliases", []),
mentions=entity.get("mentions", 1)
)
def merge_relationship(self, rel):
"""Upsert a relationship between two entities."""
source_type = self._infer_type(rel["source"], rel.get("source_type"))
target_type = self._infer_type(rel["target"], rel.get("target_type"))
rel_type = rel["type"]
query = f"""
MATCH (s:{source_type} {{name: $source_name}})
MATCH (t:{target_type} {{name: $target_name}})
MERGE (s)-[r:{rel_type}]->(t)
ON CREATE SET r.evidence = $evidence, r.confidence = $confidence, r.first_seen = timestamp()
RETURN s.name, type(r), t.name
"""
with self.driver.session() as session:
session.run(query,
source_name=rel["source"],
target_name=rel["target"],
evidence=rel.get("evidence", ""),
confidence=rel.get("confidence", 0.5)
)
def _infer_type(self, name, explicit_type=None):
"""Infer entity type from name if not provided."""
if explicit_type:
return explicit_type
# Simple heuristic — improve with a classifier for production
if any(suffix in name for suffix in ["Inc.", "Corp.", "LLC", "Ltd.", "Company"]):
return "Organization"
return "Concept" # Default type
def build_graph(self, entities, relationships):
"""Build the complete knowledge graph."""
print("Creating constraints...")
self.create_constraints()
print(f"Inserting {len(entities)} entities...")
for entity in entities:
self.merge_entity(entity)
print(f"Inserting {len(relationships)} relationships...")
for rel in relationships:
self.merge_relationship(rel)
print("Graph build complete!")
# Build the graph
builder = KnowledgeGraphBuilder("bolt://localhost:7687", "neo4j", "password")
builder.build_graph(deduped_entities, relationships)
Step 5: Query the Knowledge Graph with Natural Language
Build a LangChain-based query system:
from langchain_community.graphs import Neo4jGraph
from langchain.chains import GraphCypherQAChain
from langchain_openai import ChatOpenAI
graph = Neo4jGraph(url="bolt://localhost:7687", username="neo4j", password="password")
llm = ChatOpenAI(model="gpt-4o", temperature=0)
kg_chain = GraphCypherQAChain.from_llm(
llm=llm,
graph=graph,
verbose=True,
validate_cypher=True,
return_intermediate_steps=True,
allow_dangerous_requests=False,
top_k=50
)
# Test queries
queries = [
"What organizations does Acme Corporation partner with?",
"Who leads projects at Acme Corporation?",
"Which technologies does Acme use?",
"Find all people who work at organizations that compete with Acme",
"Show me all documents that mention both AI and PostgreSQL"
]
for q in queries:
print(f"\nQuery: {q}")
result = kg_chain.invoke({"query": q})
print(f"Cypher: {result['intermediate_steps'][0]['query']}")
print(f"Answer: {result['result']}")
Step 6: Add Temporal and Document Provenance
Knowledge graphs need context about when and where information came from:
// Add document provenance
MATCH (d:Document {title: "Acme Q3 Filing"})
MATCH (o:Organization {name: "Acme Corporation"})
CREATE (d)-[:MENTIONS {
timestamp: datetime(),
page: 12,
confidence: 0.95,
extract: "Acme Corporation announced a strategic partnership..."
}]->(o)
// Temporal query: find relationships updated in the last month
MATCH ()-[r]->()
WHERE r.first_seen > datetime() - duration('P1M')
RETURN startNode(r).name, type(r), endNode(r).name, r.first_seen
ORDER BY r.first_seen DESC
Step 7: Visualize the Graph
# Streamlit dashboard with pyvis
import streamlit as st
from pyvis.network import Network
import tempfile
def visualize_graph(driver, query="MATCH (n)-[r]->(m) RETURN n.name, n.type, type(r), m.name, m.type LIMIT 200"):
"""Create an interactive graph visualization."""
net = Network(height="600px", width="100%", bgcolor="#ffffff", font_color="black")
net.barnes_hut(gravity=-2000, central_gravity=0.3, spring_length=200)
with driver.session() as session:
results = session.run(query)
for record in results:
n_name = record["n.name"]
n_type = record.get("n.type", "unknown")
rel_type = record["type(r)"]
m_name = record["m.name"]
m_type = record.get("m.type", "unknown")
# Color by type
type_colors = {
"Person": "#4CAF50", "Organization": "#2196F3",
"Product": "#FF9800", "Technology": "#9C27B0",
"Concept": "#607D8B", "Document": "#795548"
}
net.add_node(n_name, label=n_name, color=type_colors.get(n_type, "#333333"), title=n_type)
net.add_node(m_name, label=m_name, color=type_colors.get(m_type, "#333333"), title=m_type)
net.add_edge(n_name, m_name, title=rel_type, label=rel_type, arrows="to")
# Save to HTML and display
with tempfile.NamedTemporaryFile(delete=False, suffix=".html") as f:
net.save_graph(f.name)
with open(f.name, "r") as html_file:
html_content = html_file.read()
return html_content
# Display in Streamlit
st.title("🔍 Enterprise Knowledge Graph Explorer")
html = visualize_graph(builder.driver)
st.components.v1.html(html, height=600)
What You’ve Built
You now have an enterprise knowledge graph system:
- LLM-powered entity and relationship extraction from unstructured documents
- Embedding-based duplicate resolution (85%+ accuracy)
- Neo4j graph storage with constraints and provenance tracking
- Natural language query interface via LangChain
- Interactive graph visualization dashboard
The system transforms hundreds of documents into a queryable, analyzable knowledge structure.
Troubleshooting
Entity extraction misses domain-specific terms: Add custom entity types to the extraction prompt’s ontology. For medical documents, add types like “Symptom”, “Treatment”, “Drug”. For legal documents, add “Statute”, “Precedent”, “Jurisdiction”. Update the Neo4j schema accordingly.
Duplicate resolution threshold too aggressive or lenient:
Tune the threshold parameter. Start at 0.85. If you see too many false merges (two distinct entities becoming one), raise to 0.92. If too many duplicates remain, lower to 0.78. Evaluate on a labeled test set.
Neo4j query performance degrades with >500k entities: Add indexes beyond uniqueness constraints:
CREATE INDEX entity_type_idx FOR (n:Entity) ON (n.type);
CREATE INDEX entity_name_idx FOR (n:Entity) ON (n.name);
CALL db.index.fulltext.createNodeIndex("entity_search", ["Person", "Organization", "Product"], ["name"]);
LLM generation of Cypher queries is unreliable for complex graph schemas:
Restrict the schema passed to the LLM: set graph.refresh_schema() and manually prune the schema string to only include relevant node labels and relationship types for the current user query.
Next Steps
- Add batch document processing with a scheduled ETL pipeline (Apache Airflow or n8n)
- Integrate a vector index on entity descriptions for hybrid graph + vector search
- Build a GraphRAG system: combine knowledge graph traversal with LLM reasoning
- Set up incremental updates: re-process documents on modification (webhook-based)
- Export the graph to RDF/OWL format for interoperability with standard ontology tools