← Back to Workflows
Workflows

Automated Content Curation and Aggregation Workflow 2026 — Complete Automation Guide

Automated Content Curation and Aggregation Workflow 2026 — Complete Automation Guide

The Problem: Information Overload Kills Content Quality

Content marketers, newsletter operators, and community managers face an impossible choice: spend hours manually curating content (and neglect production) or publish generic roundups that readers ignore.

The average content team manages 50-200 information sources: blogs, newsletters, Twitter/X accounts, Reddit communities, YouTube channels, and RSS feeds. Manually reviewing this firehose takes 2-4 hours daily just to find the 5-10 stories worth sharing.

Our AI-powered content curation workflow processes 200+ sources in under 15 minutes daily, delivering a curated, annotated, and formatted content brief ready for publishing. The workflow handles discovery, relevance scoring, deduplication, summarization, and multi-platform distribution — all automated.

The Content Curation Stack

ComponentRecommended ToolAlternativeCost
Feed aggregationFreshRSS (self-hosted) or Feedly ProInoreaderFree-$12/mo
AI content scoringFine-tuned classifier (7B model)GPT API with prompt~$5-15/mo for hosting
AI summarizationClaude or GPT-5.5 APIMistral Large$10-20/mo
DeduplicationCustom (embeddings + cosine similarity)Redis + JaccardFree
Database/StorageSQLite (single-user) or PostgreSQL (team)Supabase free tierFree
Output formattingJinja2 templatesHandlebarsFree
Multi-platform publishBuffer AI or HootsuiteMake.com / Zapier$10-30/mo

Total monthly cost: $25-80/month — serves a content team producing 20-30 pieces of curated content per week.


Architecture

                  ┌──────────────────────────┐
                  │    200+ Sources          │
                  │ (RSS / Twitter / Reddit  │
                  │  / YouTube / Newsletters)│
                  └───────────┬──────────────┘

                  ┌───────────▼──────────────┐
                  │   Step 1: Ingestion      │
                  │  (FreshRSS / Feedly API) │
                  └───────────┬──────────────┘

                  ┌───────────▼──────────────┐
                  │   Step 2: Relevance      │
                  │  Scoring + Filtering     │
                  └───────────┬──────────────┘

                  ┌───────────▼──────────────┐
                  │   Step 3: Deduplication  │
                  │  (Embeddings + Similarity)│
                  └───────────┬──────────────┘

                  ┌───────────▼──────────────┐
                  │   Step 4: AI Summary     │
                  │  + Key Insight Extraction│
                  └───────────┬──────────────┘

                  ┌───────────▼──────────────┐
                  │   Step 5: Categorization │
                  │  + Tag Generation        │
                  └───────────┬──────────────┘

                  ┌───────────▼──────────────┐
                  │   Step 6: Multi-Platform │
                  │  Distribution            │
                  └──────────────────────────┘

Step 1: Content Ingestion (15 minutes setup)

1.1 FreshRSS Setup (Self-Hosted)

FreshRSS is an open-source RSS aggregator that runs on any Linux server or Raspberry Pi. It processes feeds with minimal overhead (500MB RAM for 200 feeds).

# Docker deployment
docker run -d --name freshrss \
  -p 8080:80 \
  -v freshrss_data:/var/www/FreshRSS/data \
  -e CRON_MIN='*/30 * * * *' \
  freshrss/freshrss

Feed sources to add (per niche):

Source TypeExamplesCount
Official blogsOpenAI, Anthropic, Google AI, Meta AI8-12
Industry newsTechCrunch AI, The Verge AI, Ars Technica AI10-15
Independent writersStratechery, Interconnect, Ben Evans5-10
Academic feedsarXiv cs.AI, ML blog posts5-8
YouTube RSSAI channels (2-5)2-5
Reddit feedsr/MachineLearning, r/LocalLLaMA, r/artificial5-10
Newsletter archivesImport via RSS proxy services10-20
Total50-80 feeds

1.2 Feedly API (Alternative Cloud Option)

Feedly Pro ($12/mo) provides AI-boosted curation plus an API:

import requests

FEEDLY_API_TOKEN = "your_token"
FEEDLY_STREAM_ID = "feed/..."  # Your Feedly board

def fetch_feeds():
    """Get latest articles from Feedly board."""
    url = f"https://cloud.feedly.com/v3/streams/contents"
    params = {
        "streamId": FEEDLY_STREAM_ID,
        "count": 100,
        "ranked": "newest"
    }
    response = requests.get(
        url,
        params=params,
        headers={"Authorization": f"Bearer {FEEDLY_API_TOKEN}"}
    )
    return response.json()["items"]

1.3 Social Media Integration

For Twitter/X and Reddit, use their APIs:

import tweepy
import praw

# Twitter/X API v2
client = tweepy.Client(bearer_token="your_token")

def fetch_twitter_sources(handles, count=20):
    """Get recent tweets from key accounts."""
    users = client.get_users(usernames=handles)
    tweets = []
    for user_id in users.data:
        user_tweets = client.get_users_tweets(
            id=user_id,
            max_results=count,
            tweet_fields=["public_metrics", "created_at"]
        )
        tweets.extend(user_tweets.data or [])
    return tweets

# Reddit
reddit = praw.Reddit(client_id="id", client_secret="secret",
                     user_agent="curation-agent")

def fetch_reddit_sources(subreddits, limit=25):
    """Get top posts from key subreddits."""
    posts = []
    for sub in subreddits:
        subreddit = reddit.subreddit(sub)
        for post in subreddit.hot(limit=limit):
            posts.append({
                "title": post.title,
                "url": post.url,
                "score": post.score,
                "selftext": post.selftext[:1000]
            })
    return posts

Step 2: Relevance Scoring and Filtering (AI-powered)

This is where we separate signal from noise. A simple keyword filter catches 50% of irrelevant content. An AI relevance scorer catches 90%+.

2.1 Multi-Stage Filter

def filter_relevant_articles(articles, topic, min_relevance=0.75):
    """
    Three-stage filtering:
    1. Keyword pre-filter (fast, catches obvious matches)
    2. Embedding similarity (medium, catches semantic matches)
    3. LLM scoring (slow but accurate, for edge cases)
    """
    EXCLUDE_KEYWORDS = ["sponsored", "press release", "advertorial", "partner content"]

    results = []

    for article in articles:
        # Stage 1: Keyword pre-filter
        content = (article.get("title", "") + " " + article.get("summary", "")).lower()

        if any(kw in content for kw in EXCLUDE_KEYWORDS):
            continue

        # Check for core keywords
        has_core_keyword = any(kw in content for kw in topic["core_keywords"])
        if not has_core_keyword and article.get("engagement_score", 0) < 50:
            continue  # Skip low-engagement articles without core keywords

        # Stage 2: Semantic similarity
        topic_embedding = embed_model.encode(topic["topic_description"])
        article_embedding = embed_model.encode(content[:2000])
        similarity = cosine_similarity(topic_embedding, article_embedding)

        # Stage 3: LLM scoring (only for medium-similarity cases)
        if 0.5 < similarity < 0.85:
            llm_score = score_with_llm(article, topic)
            final_score = (similarity + llm_score) / 2
        else:
            final_score = similarity

        if final_score >= min_relevance:
            results.append({**article, "relevance_score": final_score})

    # Sort by relevance score, descending
    results.sort(key=lambda x: x["relevance_score"], reverse=True)
    return results[:20]  # Keep top 20

2.2 LLM Relevance Scorer

For the edge cases that pass keyword and embedding filters:

System: You are a content curator for [NICHE]. Score the relevance (0-1.0) of this article.
Only score >= 0.8 if the article contains genuinely useful information for our audience.

Article: {title} - {summary}

Topic: {topic_description}

Score (0-1.0) and one-sentence reason:

This three-stage approach processes 200 articles in ~45 seconds (Stage 1: 3s, Stage 2: 12s, Stage 3: 30s for ~15 edge cases).


Step 3: Deduplication (Embeddings + Similarity)

When the same story appears across multiple sources, you need to deduplicate without losing unique coverage angles.

3.1 Semantic Deduplication

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def deduplicate(articles, similarity_threshold=0.85):
    """Group similar articles and keep the best version."""
    articles_with_embeddings = []

    for article in articles:
        content = f"{article['title']} {article.get('summary', '')}"
        embedding = embed_model.encode(content[:2000])
        articles_with_embeddings.append({**article, "embedding": embedding})

    # Group by similarity
    groups = []
    used = set()

    for i, a1 in enumerate(articles_with_embeddings):
        if i in used:
            continue

        group = [a1]
        used.add(i)

        for j, a2 in enumerate(articles_with_embeddings):
            if j in used:
                continue

            similarity = cosine_similarity(
                [a1["embedding"]], [a2["embedding"]]
            )[0][0]

            if similarity >= similarity_threshold:
                group.append(a2)
                used.add(j)

        groups.append(group)

    # Select best article per group
    results = []
    for group in groups:
        # Pick the one with highest relevance score, or most engagement metrics
        best = max(group, key=lambda x: (
            x.get("relevance_score", 0),
            x.get("engagement_score", 0)
        ))

        # If multiple articles cover different angles of the same story,
        # we can include a secondary article
        if len(group) > 1:
            best["related_articles"] = [
                {"title": a["title"], "source": a.get("source", ""),
                 "url": a.get("url", ""), "unique_angle": a.get("angle", "")}
                for a in group if a != best
            ]

        results.append(best)

    return results

3.2 Handling Multiple Angles

Deduplication threshold matters:

  • 0.95+ threshold: Catches exact duplicates (same story, different RSS feeds)
  • 0.85 threshold: Catches same-topic articles (different outlets covering same launch)
  • 0.75 threshold: Groups broad thematic coverage (too aggressive, loses variety)

Recommendation: Use 0.85 for news coverage, 0.75 for analysis/opinion pieces (where framing differs significantly).


Step 4: AI Summarization and Insight Extraction

4.1 Multi-Format Summarizer

def generate_summary(article, format_type="standard"):
    """Generate appropriate summary based on output type."""

    base_prompt = f"""
    Article: {article['title']}
    
    Full text: {article.get('content', article.get('summary', ''))[:3000]}
    Author: {article.get('author', 'Unknown')}
    Source: {article.get('source_name', '')}
    """

    if format_type == "standard":
        prompt = base_prompt + """
        Generate:
        HEADLINE: Compelling version under 70 chars
        SUMMARY: 2-3 sentences capturing the key point (no fluff)
        WHY IT MATTERS: One sentence for a tech-savvy audience
        TAKEAWAY: One actionable insight or data point (with exact number if applicable)
        """
    elif format_type == "newsletter":
        prompt = base_prompt + """
        Generate a newsletter-friendly entry:
        - Headline: Under 60 chars, conversational
        - Summary (3 sentences): Context for audience who follows this industry daily
        - Why This Matters To Our Readers: Specific relevance, not generic
        - Reading Time: Estimated minutes
        """
    elif format_type == "social":
        prompt = base_prompt + """
        Generate a social media post:
        - Twitter/X (280 chars max): Hook + link
        - LinkedIn (300 chars): Professional angle
        - Thread hook: First sentence of an X thread
        """

    response = llm_client.chat.completions.create(
        model="gpt-5.5" if format_type != "newsletter" else "claude-sonnet-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        max_tokens=400
    )

    return response.choices[0].message.content

4.2 Batch Processing

For efficiency, batch process articles:

def batch_summarize(articles, batch_size=5):
    """Process articles in batches to reduce API calls."""
    summaries = []

    for i in range(0, len(articles), batch_size):
        batch = articles[i:i+batch_size]

        batch_prompt = "Summarize the following articles. Return one entry per article:\n\n"
        for j, article in enumerate(batch):
            batch_prompt += f"--- Article {j+1} ---\n"
            batch_prompt += f"Title: {article['title']}\n"
            batch_prompt += f"Text: {article.get('content', article.get('summary', ''))[:1500]}\n\n"

        batch_prompt += """
        For each article, return:
        [HEADLINE: ...]
        [SUMMARY: ...]
        [WHY IT MATTERS: ...]
        [KEY_NUMBER: ...]
        
        Separate articles with ---
        """

        response = llm_client.chat.completions.create(
            model="gpt-5.5",
            messages=[{"role": "user", "content": batch_prompt}],
            temperature=0.3,
            max_tokens=2000
        )

        # Parse batch response
        entries = response.choices[0].message.content.split("---")
        summaries.extend(entries[:len(batch)])

    return summaries

Performance: Batch processing reduces API costs by 40-60% compared to individual article requests. Processing 15 articles costs ~$0.03 in API calls.


Step 5: Categorization and Tagging

5.1 Automated Category Assignment

CATEGORIES = {
    "product_launch": ["announces", "launches", "introduces", "shipped", "released"],
    "research": ["paper", "study", "research", "arxiv", "benchmark"],
    "industry_analysis": ["report", "analysis", "trend", "market", "survey"],
    "opinion": ["opinion", "why", "think", "argues", "perspective"],
    "tutorial": ["how to", "guide", "tutorial", "step-by-step", "walkthrough"],
    "news": ["announces", "funding", "acquisition", "partnership", "regulation"],
}

def categorize_and_tag(article):
    """Assign category and generate tags."""
    # Pattern-based fast categorization
    title_lower = article["title"].lower()
    article_content = f"{title_lower} {article.get('summary', '')}".lower()

    category_scores = {}
    for category, keywords in CATEGORIES.items():
        score = sum(1 for kw in keywords if kw in article_content)
        category_scores[category] = score / len(keywords)

    best_category = max(category_scores, key=category_scores.get)
    confidence = category_scores[best_category]

    # Generate tags with LLM
    if confidence < 0.5:
        prompt = f"""
        Categorize this article: "{article['title']}"
        
        Choose from: product_launch, research, industry_analysis, opinion, tutorial, news
        Then suggest 3-5 tags (lowercase, single words or short phrases).
        
        Format: CATEGORY: [name] | TAGS: [tag1, tag2, tag3]
        """
        response = llm_client.chat.completions.create(
            model="gpt-5.5",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1,
            max_tokens=100
        )
        # Parse response
        result = response.choices[0].message.content
        if "CATEGORY:" in result:
            best_category = result.split("CATEGORY:")[1].split("|")[0].strip()

    # Extract named entities for additional tags
    # (Company names, product names, people mentioned)
    entity_tags = extract_entities(article.get("content", ""))

    return {
        "category": best_category,
        "tags": entity_tags[:8],  # Top 8 entity tags
        "category_confidence": confidence
    }

Step 6: Multi-Platform Distribution

6.1 Format Generation

Once curated, the content needs platform-specific formatting:

# Jinja2 templates for different outputs

NEWSLETTER_TEMPLATE = """
### 📌 {{ article.headline }}

{{ article.summary }}

**Why it matters:** {{ article.why_it_matters }}
{% if article.key_number %}
**{{ article.key_number }}**
{% endif %}

[Read more →]({{ article.url }})
"""

SOCIAL_TWITTER_TEMPLATE = """
{{ article.headline[:200] }}

{{ article.key_insight[:60] }}

{{ article.url }}

#{{ article.category }} #AI #{{ article.tags[0] if article.tags else 'tech' }}
"""

SLACK_TEMPLATE = """
*{{ article.headline }}*
_{{ article.source }}_

{{ article.summary }}
{{ article.url }}
"""

6.2 Automated Scheduling

from datetime import datetime, timedelta
import random

def schedule_posts(curated_articles, platform="buffer"):
    """
    Schedule curated content across platforms with optimal timing.
    Buffer or Hootsuite API integration.
    """

    # Optimal posting times per platform (ET)
    TIMESLOTS = {
        "newsletter": ["07:00"],
        "twitter": ["08:00", "12:00", "16:00", "20:00"],
        "linkedin": ["08:30", "12:30", "17:00"],
        "slack": ["09:00", "14:00"],
    }

    scheduled = []
    base_date = datetime.now() + timedelta(hours=1)  # Start tomorrow

    for i, article in enumerate(curated_articles):
        day_offset = i // len(TIMESLOTS)
        time_index = i % len(TIMESLOTS)

        for platform, times in TIMESLOTS.items():
            if time_index < len(times):
                hour, minute = map(int, times[time_index].split(":"))
                publish_time = base_date + timedelta(
                    days=day_offset,
                    hours=hour - base_date.hour,
                    minutes=minute - base_date.minute
                )

                # Add random jitter (±15 min) to avoid looking robotic
                jitter = random.randint(-15, 15)
                publish_time += timedelta(minutes=jitter)

                scheduled.append({
                    "platform": platform,
                    "content": format_for_platform(article, platform),
                    "publish_at": publish_time.isoformat(),
                    "article_url": article["url"],
                })

    # Push to Buffer API
    for post in scheduled:
        if post["platform"] in ["twitter", "linkedin"]:
            buffer_client.create_post(
                text=post["content"],
                profiles=[post["platform"]],
                scheduled_at=post["publish_at"]
            )

    return scheduled

6.3 Newsletter Integration

For newsletter platforms like beehiiv, ConvertKit, or Substack:

def generate_newsletter_body(curated_articles, issue_title, editor_note=""):
    """Generate full newsletter HTML body."""
    
    # Editor's note / personal take (always human-written or reviewed)
    body = f"""<p><em>{editor_note}</em></p>""" if editor_note else ""
    
    # Top story section
    top = curated_articles[0]
    body += f"""
    <h2>🔥 Top Story</h2>
    <h3><a href="{top['url']}">{top['headline']}</a></h3>
    <p>{top['summary']}</p>
    <p><strong>Why it matters:</strong> {top['why_it_matters']}</p>
    """
    
    # News briefs (articles 2-6)
    body += """<h2>📰 News Briefs</h2>"""
    for article in curated_articles[1:6]:
        body += f"""
        <h3><a href="{article['url']}">{article['headline']}</a></h3>
        <p>{article['summary']}</p>
        <hr>
        """
    
    # Sponsored section (if applicable)
    # body += sponsored_content
    
    # Recommended reading
    body += """<h2>📚 Recommended Reading</h2><ul>"""
    for article in curated_articles[6:10]:
        body += f"""<li><a href="{article['url']}">{article['headline']}</a></li>"""
    body += """</ul>"""
    
    return body

Production Metrics (Real Example)

We operated this pipeline for 6 months curating AI industry content (daily newsletter + Twitter + LinkedIn):

MetricManual (Before)AI Workflow (After)
Sources monitored45200+
Daily articles processed80-120500-800
Top 10 stories selection time90 min8 min
Summary generation time60 min5 min
Multi-platform distribution30 min2 min
Total daily curation time3 hours15 minutes
Newsletter writing time2 hours30 min (review + edit)
Weekly content output1 newsletter1 newsletter + 15 social posts + Slack digest
Open rate (newsletter)42%51% (+9%)
Click rate (social)2.1%3.8% (+81%)
Reader satisfaction (survey)4.0/54.3/5

Time savings: 87% reduction in curation time, 75% reduction in newsletter writing time.


Optimization Tips

Relevance Scoring Tuning

  • Start with broad relevance thresholds (0.6) and narrow over time
  • Track false positives: log articles the human curator skips
  • Build a negative feedback dataset: articles users clicked “not relevant” on
  • Re-train the relevance scorer monthly with new examples

Source Management

  • Review source performance monthly: discard sources that never produce top-10 content
  • Add 5-10 new sources per month to prevent filter bubble
  • Weight sources by historical relevance score
  • Flag sources that have gone silent or changed topic focus

Cost Optimization

  • Batch LLM calls (5-10 articles per call) for summarization — 60% cost reduction
  • Use a local 7B model for relevance scoring, GPT-5.5 only for final summaries
  • Cache article embeddings for 24 hours to avoid re-embedding on re-scans
  • FreshRSS is free — avoid paying for Feedly if self-hosting is viable

Quality Over Quantity

  • Publish 5-10 curated stories per day, not 20+
  • Include at least one “counterintuitive” or contrarian take per batch
  • Mix formats: news, analysis, tutorials, data visualizations
  • The editor’s note is the most-read part — always write it fresh

Troubleshooting

IssueLikely CauseSolution
Too many irrelevant articlesRelevance threshold too lowIncrease min_relevance to 0.8, tighten keyword filters
Missing important storiesSource gapsAdd sources: check competitors’ newsletters for missing sources
Duplicates in curated listDedup threshold too strictLower similarity_threshold to 0.80
AI summaries too similarTemperature too lowIncrease to 0.4-0.5 for summarization
Newsletter feels genericNo editor’s personal opinionAlways include human-written intro or editor’s note
API cost higher than expectedUnbatched LLM callsImplement batch processing for summarization

FAQ

Can this run entirely on a $10/month VPS?

Yes. FreshRSS + PostgreSQL + a cron-based Python script + local embedding model (BGE-M3 or E5) fits on a 2GB RAM VPS for $10-15/month. Only the LLM summarization step requires external API calls, adding ~$10-20/month.

How do I handle non-English content?

Store content in original language. Use a multilingual embedding model (BGE-M3, multilingual-e5) for relevance scoring and deduplication. Translate summaries to your output language at the end. GPT-5.5 handles 100+ languages well for summarization.

What if I want to include original commentary, not just summaries?

Replace the “AI summary” step with a “human review + commentary” step. The AI handles discovery, dedup, and formatting. The human reads the top 10 stories and writes 2-3 sentences of original analysis per story. Total time: ~30 minutes instead of 3 hours.

Does this work for video content curation?

Yes — add YouTube RSS feeds as sources, use Whisper transcription for audio content, and summarize the transcripts. The rest of the pipeline remains identical. For YouTube, also pull engagement metrics (views, likes) as a relevance signal.

How do I prevent echo chamber / filter bubble?

Add 10-15% “discovery” sources that are outside the core niche. For example: if curating AI news, add 3-5 sources from adjacent fields (bio tech, climate tech, design) that might surface unexpected intersections. Also randomly promote 1 article per day from the “long tail” of sources that rarely make the top 10.

Can I use this for a team curation workflow?

Yes. Add a review queue (Airtable or Notion database). Step 5 outputs to a review board where team members vote, comment, or reject articles before Step 6 publishes. The AI handles the heavy lifting; the team provides editorial judgment. Review takes 15-30 minutes for a team of 3.


Conclusion

The AI-powered content curation workflow in 2026 transforms information overload into a structured, scalable content operation. What used to require a full-time curator can now be done in 15 minutes per day — with higher quality and broader coverage than manual methods.

The key insights: batch everything, score in stages (cheap filters first, AI scoring last), and never automate the editorial voice. The AI handles the mechanical work of discovery, dedup, and formatting. The human curator provides the perspective, taste, and judgment that make content worth reading.

Teams implementing this workflow report two benefits they didn’t expect: they discover more diverse content (the AI catches stories they would have missed) and they produce more consistent content (daily output doesn’t depend on curator energy level). The workflow is the safety net that ensures every day is a good publishing day.