Build an Automated Blog Content Pipeline with RSS + LLM in 2026

Introduction

Blogging at scale means spending hours reading, summarizing, and writing. What if you could automate the pipeline — from finding good content to publishing a draft — without sacrificing quality?

This tutorial builds an end-to-end automated blog content pipeline that:

Fetches articles from RSS feeds (industry news, competitor blogs, arXiv)
Extracts and cleans article content
Runs each article through an LLM prompt chain for analysis and rewriting
Outputs a polished, SEO-optimized blog post as Markdown

By the end, you’ll have a cron-ready script that produces publishable drafts daily.

Prerequisites

pip install feedparser requests beautifulsoup4 openai langchain-core tiktoken python-frontmatter arxiv

Set your OpenAI or compatible API key:

export OPENAI_API_KEY="sk-..."  # or use any OpenAI-compatible provider

Step 1: RSS Feed Fetcher

Create fetcher.py that pulls articles from multiple feeds:

import feedparser
import hashlib
from datetime import datetime
from dataclasses import dataclass

@dataclass
class FeedItem:
    id: str
    title: str
    url: str
    summary: str
    published: str
    source: str

def fetch_feeds(feed_urls: list[str], max_per_feed: int = 5) -> list[FeedItem]:
    items = []
    seen = set()
    
    for url in feed_urls:
        try:
            feed = feedparser.parse(url)
            source_name = feed.feed.title if hasattr(feed.feed, 'title') else url
            
            for entry in feed.entries[:max_per_feed]:
                entry_id = hashlib.md5(entry.link.encode()).hexdigest()
                if entry_id in seen:
                    continue
                seen.add(entry_id)
                
                items.append(FeedItem(
                    id=entry_id,
                    title=entry.title,
                    url=entry.link,
                    summary=entry.summary if hasattr(entry, 'summary') else '',
                    published=entry.published if hasattr(entry, 'published') else '',
                    source=source_name,
                ))
        except Exception as e:
            print(f"⚠️  Error fetching {url}: {e}")
    
    return items

# Example feed sources
FEEDS = [
    "https://news.ycombinator.com/rss",
    "https://arxiv.org/rss/cs.AI",
    "https://rss.nytimes.com/services/xml/rss/nyt/Technology.xml",
    "https://feeds.feedburner.com/TechCrunch",
]

if __name__ == "__main__":
    items = fetch_feeds(FEEDS)
    print(f"Fetched {len(items)} articles")
    for item in items[:5]:
        print(f"  • [{item.source}] {item.title}")

Step 2: Content Extractor

Raw RSS summaries are often short. We need the full article content. Use trafilatura for clean extraction:

# extractor.py
import trafilatura
import trafilatura.settings as settings

def extract_article(url: str) -> str | None:
    """Extract clean article text from a URL."""
    try:
        downloaded = trafilatura.fetch_url(url)
        if not downloaded:
            return None
        text = trafilatura.extract(
            downloaded,
            include_comments=False,
            include_tables=False,
            include_formatting=True,
            output_format='markdown',
        )
        return text
    except Exception as e:
        print(f"⚠️  Failed to extract {url}: {e}")
        return None

Note: Install trafilatura alongside the other packages: pip install trafilatura

Step 3: LLM Content Pipeline (The Core)

This is where the magic happens. We chain two prompts — one for analysis, one for rewriting — using LangChain’s ChatOpenAI:

# pipeline.py
import json
import os
from datetime import datetime
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(
    model="gpt-4o-mini",  # Cost-effective for this workload
    temperature=0.3,
)

ANALYSIS_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are a senior content strategist. Analyze the following article and produce a JSON output with:
- topic: The main topic (2-5 words)
- angle: A fresh, unique angle that hasn't been overdone
- target_keyword: Primary SEO keyword
- key_points: 3-5 bullet points from the article
- gap: What's missing from this article that readers would want
- hook: An engaging opening sentence for a new post"""),
    ("human", "Title: {title}\n\nContent:\n{content}"),
])

REWRITE_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are a professional blog writer. Write a compelling blog post following this structure:

1. **Hook** — Start with the provided hook
2. **Context** — Briefly explain why this matters now (2-3 sentences)
3. **Main Content** — Expand the key points with your own analysis and examples. Add context the original article missed (the "gap").
4. **Actionable Takeaways** — What should the reader do with this information?
5. **Related Resources** — Mention 1-2 related tools or further reading

Requirements:
- SEO-optimized: Include the target keyword naturally in the first 100 words and once per section
- 500-800 words
- Write for a technical but non-expert audience (CTO, product manager, senior developer)
- Professional but not boring — use concrete examples
- Avoid fluff: no "In today's fast-paced..." or "Revolutionary" or "Game-changing"
- Output as clean Markdown with appropriate headings"""),
    ("human", """Analysis: {analysis}
    
Original Title: {title}
Original Source: {source}"""),
])

def process_article(title: str, content: str, source: str) -> dict | None:
    """Run a single article through the LLM pipeline and return a blog post draft."""
    if not content or len(content) < 200:
        return None
    
    # Step 1: Analyze
    try:
        response = llm.invoke(ANALYSIS_PROMPT.format_messages(title=title, content=content[:4000]))
        analysis = json.loads(response.content.strip().removeprefix("```json").removesuffix("```").strip())
    except (json.JSONDecodeError, Exception) as e:
        print(f"⚠️  Analysis failed for '{title}': {e}")
        return None
    
    # Step 2: Rewrite
    try:
        blog_post = llm.invoke(REWRITE_PROMPT.format_messages(
            analysis=json.dumps(analysis, indent=2),
            title=title,
            source=source,
        )).content
    except Exception as e:
        print(f"⚠️  Rewrite failed for '{title}': {e}")
        return None
    
    return {
        "title": f"{analysis.get('topic', title)}: {analysis.get('hook', '')}"[:100],
        "slug": title.lower().replace(" ", "-")[:80],
        "meta_description": analysis.get('hook', ''),
        "target_keyword": analysis.get('target_keyword', ''),
        "body": blog_post,
        "source_url": None,  # Will be filled by the orchestrator
        "source_name": source,
    }

Step 4: Orchestrator

Tie it all together in run_pipeline.py:

#!/usr/bin/env python3
"""Automated blog content pipeline orchestrator."""

from fetcher import fetch_feeds, FEEDS
from extractor import extract_article
from pipeline import process_article
import os
from datetime import datetime

OUTPUT_DIR = "./generated_posts"
MAX_ARTICLES = 3  # Generate at most 3 posts per run

def generate_frontmatter(post: dict) -> str:
    return f"""---
title: "{post['title']}"
date: {datetime.now().strftime('%Y-%m-%d')}
author: "AI Content Pipeline"
category: "Automated"
tags: [{', '.join(f'"{t}"' for t in post.get('tags', []))}]
cover: "/images/default/{post.get('target_keyword', 'blog')}.jpg"
meta_description: "{post['meta_description']}"
---
"""

def main():
    os.makedirs(OUTPUT_DIR, exist_ok=True)
    
    print("🚀 Fetching RSS feeds...")
    items = fetch_feeds(FEEDS)
    print(f"📡 Found {len(items)} articles")
    
    count = 0
    for item in items:
        if count >= MAX_ARTICLES:
            break
        
        print(f"📄 Processing: {item.title[:60]}...")
        content = extract_article(item.url)
        if not content:
            print("  ⏭️  Skipped (no content)")
            continue
        
        result = process_article(item.title, content, item.source)
        if not result:
            print("  ⏭️  Skipped (LLM failed)")
            continue
        
        # Generate a unique slug
        safe_slug = item.title.lower()[:60]
        safe_slug = "".join(c if c.isalnum() or c in " -" else "" for c in safe_slug)
        safe_slug = safe_slug.replace(" ", "-").strip("-")
        filename = f"{datetime.now().strftime('%Y%m%d')}-{safe_slug}.md"
        
        with open(os.path.join(OUTPUT_DIR, filename), "w") as f:
            f.write(generate_frontmatter(result))
            f.write(result['body'])
        
        print(f"  ✅ Written to generated_posts/{filename}")
        count += 1
    
    print(f"\n🎉 Done! Generated {count} blog post(s) in {OUTPUT_DIR}/")

if __name__ == "__main__":
    main()

Step 5: Run and Review

# One-shot run
python run_pipeline.py

# Expected output:
# 🚀 Fetching RSS feeds...
# 📡 Found 18 articles
# 📄 Processing: OpenAI Announces GPT-5 Fine-Tuning...
#   ✅ Written to generated_posts/20260530-openai-announces-gpt-5-fine-tuning.md

Each generated post looks like (abbreviated):

---
title: "OpenAI GPT-5 Fine-Tuning Now Available for All Developers"
date: 2026-05-30
author: "AI Content Pipeline"
category: "Automated"
tags: ["openai", "gpt-5", "fine-tuning"]
cover: "/images/default/gpt-5-fine-tuning.jpg"
meta_description: "GPT-5 fine-tuning is now open to all developers. Here's what changed and how to use it."
---

## GPT-5 Fine-Tuning: What Changed

OpenAI just opened GPT-5 fine-tuning to all paid-tier developers. Previously limited to enterprise customers, this change means startups and individual developers can now customize the model...

## Why This Matters Now

The timing isn't accidental. With Llama 4 and DeepSeek V4 offering competitive open-weight models, OpenAI needs to keep developers on its platform. Fine-tuning access is the carrot...

Step 6: Production Hardening

Deduplication

Track previously processed articles to avoid duplicates:

import sqlite3

def init_db():
    conn = sqlite3.connect("pipeline_state.db")
    conn.execute("CREATE TABLE IF NOT EXISTS processed (article_id TEXT PRIMARY KEY)")
    return conn

def is_processed(conn: sqlite3.Connection, article_id: str) -> bool:
    return conn.execute("SELECT 1 FROM processed WHERE article_id=?", (article_id,)).fetchone() is not None

def mark_processed(conn: sqlite3.Connection, article_id: str):
    conn.execute("INSERT OR IGNORE INTO processed (article_id) VALUES (?)", (article_id,))
    conn.commit()

Crontab Schedule

Run daily at 6 AM:

0 6 * * * cd /path/to/pipeline && python run_pipeline.py >> pipeline.log 2>&1

Cost Estimate

Component	Cost per Run (3 articles)
GPT-4o-mini (analysis)	~$0.01
GPT-4o-mini (rewrite)	~$0.03
API calls + bandwidth	~$0.001
Total per run	~$0.04
Monthly (30 days)	~$1.20

Switch to GPT-4o for higher quality at ~$0.30/run, or use a local model via Ollama for zero API cost.

Customization Ideas

Source filtering: Add keyword whitelists so you only process articles about specific topics
Tone profiles: Create multiple prompt variants (news-style, tutorial-style, opinion)
Image generation: Pipe the article through DALL-E or Stable Diffusion for a cover image
Auto-publish: Connect the output to your CMS API (WordPress, Ghost, Notion)

Full Pipeline Script

Save all four files (fetcher.py, extractor.py, pipeline.py, run_pipeline.py) in the same directory. Run python run_pipeline.py after installing dependencies.

Conclusion

You’ve built a production-grade automated blog content pipeline in under 200 lines of Python. It fetches real articles, analyzes them with an LLM, and produces SEO-optimized drafts ready for human review.

The key insight: this doesn’t replace writers. It replaces the research and first-draft phase. A good editor can take a pipeline draft and make it excellent in 15 minutes — versus spending 2 hours starting from scratch. Use it to scale your content operation, not to cut corners.