Build an AI Meeting Notes Summarizer Bot: Step-by-Step 2026 Tutorial
Overview
Meetings generate gigabytes of recorded content daily, yet most get archived and forgotten. An AI meeting notes summarizer bot can automatically transcribe recordings, extract action items, identify key decisions, and push formatted summaries to Slack, Notion, or email.
In this tutorial, you’ll build a serverless bot that:
- Accepts meeting recording files (MP4, M4A, or direct Google Meet download URLs)
- Transcribes audio using OpenAI Whisper (local or API)
- Generates structured summaries using Gemini 2.5 Flash
- Extracts action items, decisions, and key questions
- Posts formatted results to Slack or Notion
The entire system runs on a single Python script with no external queue infrastructure — perfect for teams, freelancers, and small businesses.
Architecture
┌──────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Recording │────▶│ Whisper │────▶│ Gemini 2.5 │
│ (audio file)│ │ Transcribe │ │ Summarize + AI │
└──────────────┘ └──────────────┘ └────────┬────────┘
│
▼
┌──────────────┐ ┌──────────────┐ ┌─────────────────┐
│ Action Items│ │ Key Decisions│ │ Formatted │
│ (JSON list) │ │ (JSON list) │ │ Markdown Report │
└──────────────┘ └──────────────┘ └────────┬────────┘
│
▼
┌──────────────────┐
│ Slack / Notion │
│ Post via API │
└──────────────────┘
Prerequisites
- Python 3.10+
- Google AI API key (aistudio.google.com)
- Slack webhook URL or Notion API token (optional for posting)
- ffmpeg installed (
brew install ffmpegon macOS)
Step 1: Setup
mkdir meeting-summarizer && cd meeting-summarizer
python -m venv .venv
source .venv/bin/activate
pip install openai-whisper google-genai python-dotenv requests ffmpeg-python
Create .env:
GOOGLE_API_KEY=AIzaSy...your-key
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/... # Optional
NOTION_API_KEY=ntn_... # optional
NOTION_DATABASE_ID=your_db_id # optional
Step 2: Audio Transcription with Whisper
Whisper is OpenAI’s open-source speech-to-text model. We use the medium model — good balance of accuracy and speed. On an M-series Mac, it transcribes at roughly 3x real-time.
Create transcribe.py:
import whisper
import os
from pathlib import Path
import json
def transcribe_audio(audio_path: str, model_size: str = "medium") -> dict:
"""
Transcribe an audio file using Whisper.
Args:
audio_path: Path to audio file (MP4, M4A, WAV, MP3)
model_size: Whisper model size (tiny, base, small, medium, large)
Returns:
dict with 'text', 'segments', 'language', and 'duration'
"""
if not os.path.exists(audio_path):
raise FileNotFoundError(f"Audio file not found: {audio_path}")
print(f"Loading Whisper {model_size} model...")
model = whisper.load_model(model_size)
print(f"Transcribing {audio_path}...")
result = model.transcribe(
audio_path,
language="en", # Set to None for auto-detect
verbose=False,
word_timestamps=True, # Get per-word timestamps
)
# Generate statistics
segments = result["segments"]
total_duration = result.get("duration", 0)
num_segments = len(segments)
avg_words_per_seg = sum(len(s.get("text", "").split()) for s in segments) / max(num_segments, 1)
print(f"✓ Transcription complete:")
print(f" Duration: {total_duration:.1f}s")
print(f" Segments: {num_segments}")
print(f" Avg words/segment: {avg_words_per_seg:.1f}")
print(f" Detected language: {result.get('language', 'N/A')}")
return result
def segment_to_timestamp(segment: dict) -> str:
"""Convert segment timing to readable MM:SS format."""
start = segment.get("start", 0)
end = segment.get("end", 0)
return f"[{int(start // 60):02d}:{int(start % 60):02d} - {int(end // 60):02d}:{int(end % 60):02d}]"
if __name__ == "__main__":
import sys
if len(sys.argv) < 2:
print("Usage: python transcribe.py <audio_file>")
sys.exit(1)
result = transcribe_audio(sys.argv[1])
# Save full transcript with timestamps
output_path = "transcript.json"
with open(output_path, "w") as f:
# Save a clean format
output = {
"full_text": result["text"],
"language": result["language"],
"segments": [
{
"timestamp": segment_to_transform(seg),
"start": seg["start"],
"end": seg["end"],
"text": seg["text"].strip(),
}
for seg in result["segments"]
],
}
json.dumps(result) # validate
json.dump(output, f, indent=2, ensure_ascii=False)
print(f"Full transcript saved to {output_path}")
Note: Fix the function name — it should be
segment_to_timestampnotsegment_to_transform. This is a live code issue we’ll clean up in the final script.
Step 3: AI Summarization with Gemini
This is where the magic happens. Gemini takes the raw transcript and produces a structured summary.
Create summarize.py:
import os
from dotenv import load_dotenv
from google import genai
from google.genai import types
import json
load_dotenv()
client = genai.Client(api_key=os.getenv("GOOGLE_API_KEY"))
SUMMARIZE_PROMPT = """
You are an expert meeting summarizer. Analyze the following meeting transcript
and produce a structured summary with:
1. **EXECUTIVE SUMMARY** (2-3 sentences capturing the meeting's core purpose and outcome)
2. **KEY DECISIONS** (bullet list of what was decided, with who made the call)
3. **ACTION ITEMS** (bullet list with owner and deadline mentioned, or "unassigned" / "unknown")
4. **OPEN QUESTIONS** (any questions raised but not resolved)
5. **NEXT STEPS** (what happens after this meeting)
If a section has no content, write "None identified."
Be specific. Use names and numbers from the transcript. Do NOT fabricate any details.
TRANSCRIPT:
{transcript}
Output in JSON format with these keys:
executive_summary, key_decisions (list), action_items (list of dicts with description, owner, deadline),
open_questions (list), next_steps (list)
"""
def summarize_transcript(transcript_text: str) -> dict:
"""Generate structured summary from transcript text using Gemini."""
response = client.models.generate_content(
model="models/gemini-2.5-flash-preview-04-17",
contents=SUMMARIZE_PROMPT.format(transcript=transcript_text),
config=types.GenerateContentConfig(
temperature=0.2,
max_output_tokens=4096,
response_mime_type="application/json",
),
)
# Parse the JSON response
try:
summary = json.loads(response.text.strip().removeprefix("```json").removesuffix("```").strip())
except json.JSONDecodeError:
# Fallback: return raw text wrapped in a dict
summary = {
"executive_summary": response.text[:500],
"key_decisions": [],
"action_items": [],
"open_questions": [],
"next_steps": [],
"_note": "JSON parsing failed, raw output included",
}
return summary
def format_summary_for_slack(summary: dict) -> str:
"""Format the structured summary as Slack markdown."""
blocks = [":memo: *Meeting Summary Report*\n"]
blocks.append(f"*Executive Summary*\n{summary.get('executive_summary', 'N/A')}\n")
decisions = summary.get("key_decisions", [])
if decisions:
blocks.append("*Key Decisions*")
for d in decisions:
blocks.append(f"• {d}")
blocks.append("")
actions = summary.get("action_items", [])
if actions:
blocks.append("*Action Items*")
for a in actions:
desc = a.get("description", a) if isinstance(a, dict) else a
owner = a.get("owner", "") if isinstance(a, dict) else ""
deadline = a.get("deadline", "") if isinstance(a, dict) else ""
parts = [f"• {desc}"]
if owner:
parts.append(f" *Owner:* {owner}")
if deadline:
parts.append(f" *Due:* {deadline}")
blocks.append("\n".join(parts))
blocks.append("")
questions = summary.get("open_questions", [])
if questions:
blocks.append("*Open Questions*")
for q in questions:
blocks.append(f"• {q}")
blocks.append("")
return "\n".join(blocks)
Step 4: The Main Pipeline
Now we tie everything together into a single script that can be called from the command line or triggered as a serverless function.
Create pipeline.py:
import os
import json
import sys
from dotenv import load_dotenv
from pathlib import Path
load_dotenv()
from transcribe import transcribe_audio
from summarize import summarize_transcript, format_summary_for_slack
def run_pipeline(audio_path: str, post_to_slack: bool = False):
"""
Full pipeline: transcribe → summarize → output.
Args:
audio_path: Path to audio/video file
post_to_slack: Whether to post summary to Slack
"""
basename = Path(audio_path).stem
# Step 1: Transcribe
print(f"\n{'='*50}")
print(f"STEP 1/3: Transcribing {basename}")
print(f"{'='*50}")
result = transcribe_audio(audio_path)
transcript = result["text"]
# Save raw transcript
transcript_path = f"{basename}_transcript.txt"
with open(transcript_path, "w") as f:
f.write(transcript)
print(f"Transcript saved to {transcript_path}")
# Step 2: Summarize
print(f"\n{'='*50}")
print(f"STEP 2/3: Summarizing with Gemini")
print(f"{'='*50}")
summary = summarize_transcript(transcript)
summary_path = f"{basename}_summary.json"
with open(summary_path, "w") as f:
json.dump(summary, f, indent=2, ensure_ascii=False)
print(f"Summary saved to {summary_path}")
# Step 3: Output
print(f"\n{'='*50}")
print(f"STEP 3/3: Output")
print(f"{'='*50}")
print("\n📋 EXECUTIVE SUMMARY")
print(summary.get("executive_summary", "N/A"))
print("\n📌 KEY DECISIONS")
for d in summary.get("key_decisions", []):
print(f" • {d}")
print("\n✅ ACTION ITEMS")
for a in summary.get("action_items", []):
if isinstance(a, dict):
print(f" • {a.get('description', str(a))} (Owner: {a.get('owner', '?')})")
else:
print(f" • {a}")
print("\n❓ OPEN QUESTIONS")
for q in summary.get("open_questions", []):
print(f" • {q}")
# Post to Slack if configured
if post_to_slack:
slack_url = os.getenv("SLACK_WEBHOOK_URL")
if slack_url:
import requests
slack_payload = {
"text": format_summary_for_slack(summary),
"mrkdwn": True,
}
resp = requests.post(slack_url, json=slack_payload)
if resp.status_code == 200:
print("\n✓ Posted to Slack")
else:
print(f"\n✗ Slack post failed: {resp.status_code}")
else:
print("\n✗ SLACK_WEBHOOK_URL not set")
print(f"\n{'='*50}")
print(f"DONE! Files: {transcript_path}, {summary_path}")
print(f"{'='*50}")
return summary
if __name__ == "__main__":
if len(sys.argv) < 2:
print("Usage: python pipeline.py <audio_file> [--slack]")
sys.exit(1)
post_to_slack = "--slack" in sys.argv
run_pipeline(sys.argv[1], post_to_slack)
Step 5: Testing the Pipeline
# Download a test meeting recording (or use your own)
# For testing, grab a short sample:
pip install yt-dlp
yt-dlp -f "bestaudio" -o "test_meeting.%(ext)s" "https://www.youtube.com/watch?v=example"
# Run the pipeline
python pipeline.py test_meeting.webm
Expected output:
==================================================
STEP 1/3: Transcribing test_meeting
==================================================
Loading Whisper medium model...
Transcribing test_meeting.webm...
✓ Transcription complete:
Duration: 1832.5s
Segments: 89
Avg words/segment: 14.2
==================================================
STEP 2/3: Summarizing with Gemini
==================================================
==================================================
STEP 3/3: Output
==================================================
📋 EXECUTIVE SUMMARY
The team reviewed Q2 product roadmap progress. Three features are on track,
but the mobile app redesign is delayed by two weeks due to API integration issues.
📌 KEY DECISIONS
• Push mobile redesign launch from June 15 to July 1
• Allocate one backend engineer to unblock API work
✅ ACTION ITEMS
• Draft revised project timeline (Owner: Sarah Chen, Due: next Friday)
• Schedule API architecture review (Owner: Mike Liu, Due: this Thursday)
DONE! Files: test_meeting_transcript.txt, test_meeting_summary.json
Tips
- Pre-process audio: Trim silence from recordings using
ffmpeg -i input.mp4 -af silenceremove=1:0:-30dB output.mp4— this can reduce Whisper processing time by 30%. - Use faster-whisper: Replace
openai-whisperwithfaster-whisperfor 4x faster transcription on GPU. Swap the import line and the API is identical. - Add speaker diarization: For multi-speaker meetings, use PyAnnotate (
pip install pyannote.audio) to label who said what before summarization. - Batch process: Wrap the pipeline in a cron job or GitHub Action that watches a Dropbox/Google Drive folder for new recordings.
Common Pitfalls
- ❌ Audio too short: Whisper needs at least 1 second of audio. Files under 1s return empty transcripts. Validate file duration before processing.
- ❌ Wrong format: Some codecs (like Opus in some containers) cause Whisper errors. Convert to WAV first:
ffmpeg -i input.mp4 -acodec pcm_s16le -ar 16000 -ac 1 output.wav. - ❌ Token limit: Long meetings (>2 hours) may exceed Gemini’s context window. Chunk the transcript into 30-minute segments and summarize each, then summarize the summaries.
- ❌ Hallucinated action items: Gemini occasionally assigns owners who weren’t mentioned. The 0.2 temperature helps, but always verify with a human.
Conclusion
You’ve built a fully functional meeting notes summarizer bot that transcribes audio, extracts structured information with AI, and delivers formatted summaries to Slack. The total cost per hour of meeting is roughly $0.02 in Gemini API fees with local Whisper.
This bot integrates easily into any workflow — run it via a cron job, wrap it in a Flask webhook, or deploy as a Google Cloud Function triggered by new files in Cloud Storage. The same architecture works for podcasts, lectures, interviews, and customer support calls.