Multi-Agent SEO Article System: 8 AI Agents

Q: How many agents do you need to generate an SEO article with AI?

The system uses 8 specialized agents in sequential pipeline: keyword research, Google suggest, SERP scraping, structural analysis, FAQ generation, SEO strategy, content generation, and quality assurance. Each agent has a specific task and communicates with the next via typed Pydantic models.

Q: Why use multiple agents instead of a single prompt?

A single prompt doesn't have access to real SERP data, doesn't analyze competitor structure, and can't validate its own output. A multi-agent system splits the problem: each agent specializes in one phase, uses its own tools, and produces structured output that gets verified before passing to the next phase.

Q: How long does it take to generate an article with this system?

The full pipeline takes 10–15 minutes, most of which is competitor scraping and analysis time. Speed varies based on keyword competitiveness and site response times. Human review checkpoints add time but significantly improve quality.

Q: Can the system work without human intervention?

Technically yes—checkpoints can be bypassed. But the design includes 5 human review points because experience proved QA automation alone misses tone of voice issues, irrelevant sections, and subtle keyword stuffing. Human checkpoints are where SEO expertise enters the process.

Q: What's the biggest limitation of this system?

The prompt design of the Content Generator agent (#7). The infrastructure (pipeline, validation, scraping) is solid, but the agent producing the text lacks brand voice guidelines, few-shot examples, and coded editorial anti-patterns. Result: structurally correct articles that still need 30–40 minutes of human editing to reach publishable quality.

Eight specialized AI agents, a sequential pipeline, five human review checkpoints. The result: SEO articles built on real competitor data, not generic templates.

Everyone talks about "generating content with AI". The problem is most approaches boil down to a single prompt that produces generic text—disconnected from SERP data, without competitor analysis, without quality verification. Result? Content that all looks the same, and Google has learned to recognize and penalize it.

I built a different system. An orchestrator that coordinates 8 specialized agents: each with a precise task, each with its own set of tools, all linked in sequence. You start with keyword research and end with an automated quality score. In between: SERP scraping, competitor structural analysis, FAQ generation, SEO strategy, and article production. With human review at every critical node.

If you've read my article on AI agent workflow patterns for SEO, this is the practical case study. There I described the patterns; here I show what happens when you apply them to a real project.

A single prompt—even a long one, even expertly engineered—has structural limits when the goal is producing ranking-optimized content.

No real data: the model doesn't know what the top 5 results rank for your keyword today, how many words they use, what heading structure they have
No competitive context: without analyzing what already ranks, you're writing blind
No validation: a prompt produces output and stops—there's no one checking keyword density, length, semantic coverage
Limited context window: cramming keyword research, SERP analysis, strategy, and generation into a single prompt means sacrificing depth everywhere

That's why I decomposed each phase into a dedicated agent. Every phase of SEO content creation becomes a specialized agent, with its own prompt, its own tools, and structured output that feeds the next agent.

Architecture: 8 agents in sequential pipeline

The system follows pure sequential pattern—each agent depends on the previous one's output. I chose this over parallel because each phase enriches the context that the next phase consumes. You can't generate an SEO strategy without first analyzing competitors.

The 8 agents and their roles

#	Agent	Input	Output
1	Keyword Research	Main keyword + volume data	Semantic clusters ranked by volume and difficulty
2	Google Suggest	Main keyword	Autocomplete + related searches from Google
3	SERP Scraper	Main keyword	Top 5 organic results with title, URL, description
4	Content Structure	Competitor URLs	Heading structure, word count, tables, images
5	FAQ Generator	Google suggestions	Natural FAQs derived from real user questions
6	SEO Strategy	All previous data	Target length (competitor average × 1.2)
7	Content Generator	Strategy + FAQs + keywords	Complete article with meta tags
8	Quality Assurance	Article + strategy	Score 0-100 and approval status

The human review flow

The system isn't fully autonomous—by design. I built in 5 human review checkpoints at the places where upstream errors would propagate downstream.

After keyword research—to validate that semantic clusters match article intent
After structural analysis—to verify that analyzed competitors are actually relevant
After FAQ generation—to discard irrelevant or duplicate questions
After article generation—for editorial review before QA
After final QA—approval or revision request

Each checkpoint is asynchronous: the user approves with "yes" or rejects. A rejection stops the pipeline and logs the exact interruption point, so the workflow can resume from the right phase.

Five manual approvals per article are manageable if you produce 2–3 pieces a week. For higher volumes, the system allows checkpoints to be bypassed. In practice, after 10–15 articles in the same vertical domain, checkpoints 1 (keywords) and 3 (FAQs) become near-automatic approvals because the system is already calibrated. The ones you can never skip are 4 (editorial review) and 5 (final approval).

A concrete example: from keyword to article

You've read enough abstract descriptions. Here's what actually happens when I run the pipeline on a keyword. The example is anonymized but the numbers are real.

Input: keyword "SEO consulting for e-commerce"

The Keyword Research agent receives the main keyword and volume data (720/month, difficulty 38). It produces 4 semantic clusters: "e-commerce SEO consulting" (core), "e-commerce SEO audit" (related), "SEO for Shopify/WooCommerce" (platform-specific), "product listing optimization" (operational). Checkpoint #1: approved—clusters cover the intent.

The SERP Scraper agent finds the top 5: two agencies, one technical blog, one Shopify article, one Semrush guide. Average word count: 2,840. The Content Structure agent analyzes the pages: all 5 have at least 6 H2s, 3 out of 5 have tables, 4 out of 5 have an FAQ section.

The Strategy agent outputs: recommended length 3,408 words (2,840 × 1.2). The Content Generator produces the article. The QA agent scores it 84 (length score 91 × 0.6 + keyword score 85 × 0.4). Status: approved.

Total time: 12 minutes of pipeline + 4 minutes of human review across 5 checkpoints. The result is a 3,200-word article with heading structure derived from competitors, FAQs generated from real Google questions, and meta tags within limits.

Technology stack

Component	Technology	Why
Agent framework	OpenAI Agents SDK	Declarative @function_tool decorators, async execution, persistent sessions
LLM	GPT-4.1 Mini	Good quality-to-cost ratio for structured tasks
Data validation	Pydantic	Strict schema enforcement between agents—JSON serialization required for OpenAI SDK
Web scraping	Bright Data SDK	Reliable SERP scraping with proxy rotation
Scraping fallback	Pyppeteer	Headless browser as backup for protected pages
HTML parsing	BeautifulSoup	Extract headings, tables, word counts from competitors
Persistence	SQLite	Save sessions for workflow resumption
Runtime	Python asyncio	All I/O operations are async

Why OpenAI Agents SDK over LangChain

I chose OpenAI Agents SDK for three practical reasons. First: the @function_tool decorator makes tool definition declarative—each agent declares what it can do, and the framework handles routing. Second: Runner.run() manages agent lifecycle cleanly, with SQLiteSession for automatic session persistence. Third: Pydantic validation is native—every agent output is a typed BaseModel, and communication between agents is type-safe.

LangChain would have worked, but for a sequential pipeline with well-defined tools, the abstraction overhead wasn't justified. This system needs predictability, not flexibility.

Deep dive: the three agents that make the difference

Agent #3: SERP Scraper—the real data

This agent is the system's informational heart. It uses the Bright Data SDK to scrape the top 5 organic results for the target keyword. For each result, it extracts: position, title tag, URL, meta description, and estimated word count.

The value isn't in single data points but in aggregation. When the SEO Strategy agent receives this data, it has a concrete picture of what Google rewards for that query: long articles or short? With FAQs? With tables? Deep structure (H2 → H3 → H4) or flat?

The Pyppeteer fallback is critical: about 15% of competitor pages have anti-scraping protections that block Bright Data. In those cases, the headless browser succeeds where the API fails.

Agent #4: Content Structure—the competitor X-ray

Competitor URLs go to BeautifulSoup for pure structural analysis. It doesn't look at content—it looks at structure. It extracts the heading hierarchy (how many H2s, how many H3s, how they're distributed), presence of tables, number of images, exact word count.

The module also includes structural analysis functions (_analyze_structure_patterns, _find_common_patterns, _generate_structural_recommendations) designed to calculate word count averages, heading distributions, and common patterns. In the current implementation, they're not yet integrated into the main agent tool flow, but they represent the analysis layer I plan to wire into the next iteration.

Agent #6: SEO Strategy—where it all comes together

By now the pipeline has all the data it needs. The Strategy agent synthesizes it and produces the key parameter: recommended article length, calculated as competitor average × 1.2. The 20% buffer allows you to treat the topic more deeply than those already ranking.

That said, the formula is crude and I know it. Word count is a proxy, not a quality metric. A competitor ranking with 3,000 words of dense, structured content isn't the same as one ranking with 3,000 words of filler. The average treats them the same way, and × 1.2 doesn't distinguish. For low-competition keywords the formula works well. For keywords where content quality matters more than quantity, checkpoint #4 (editorial review) is what actually compensates.

The data model also includes fields for target keyword density and recommended heading structure (SEOAnalysis in Pydantic), but in the current implementation the agent tool only outputs target length. This is where the system has the most growth potential: connecting competitor structural analysis to strategy would make the Content Generator's output much more targeted.

This strategy becomes the brief for the Content Generator agent. The generated article must fit the recommended length. The #7 agent's prompt also instructs on meta title under 60 characters and meta description under 160, though there's no automatic validation on these limits.

Quality assurance system

Agent #8 is the gatekeeper. It receives the generated article and strategy, and produces a composite score on a 0–100 scale.

Metric	Weight	Calculation
Length Score	60%	(word_count / recommended_length) × 100
Keyword Score	40%	Placeholder fixed at 85 (actual analysis not yet implemented)

Honestly, scoring is the system's weakest point. The length score works well because it's objective. The keyword score is a hardcoded placeholder at 85—real keyword density and distribution analysis isn't implemented yet. That's the next piece to build.

The score determines status:

Approved (80+): article is ready for publication
Minor fixes (65–79): small adjustments needed
Major fixes (<65): rewrite needed—content doesn't match strategy

In practice, after dozens of tests, articles that pass all 5 human checkpoints almost always score above 80. Problems get caught well before final QA.

The elephant in the room: the content generator prompt

I've described 8 agents, tools, Pydantic validation, human checkpoints. But the single biggest factor on final output quality is one thing: the system prompt of agent #7, the Content Generator. And it's the piece nobody talks about, because it's the hardest to engineer.

The model (GPT-4.1 Mini, Claude, or any other) is a factor. But any LLM produces editorially flat text if the prompt doesn't specify tone, register, argumentative structure, and anti-patterns to avoid. "Write an SEO article on X" produces slop. "Write a first-person article with practical tone, avoiding [explicit list of formulas], with a hook in the first paragraph and a section on honest limitations" produces something different.

In the current implementation, agent #7's prompt receives the strategy (target length), generated FAQs, and main keyword. It instructs the model on meta title < 60 characters and meta description < 160. But it doesn't specify voice, doesn't have examples of desired output, and doesn't list editorial anti-patterns. Result: structurally correct but editorially generic articles. They always need 30–40 minutes of human editing post-pipeline.

The next serious investment in this system isn't QA or parallelization. It's prompt engineering of agent #7. Few-shot examples from approved articles, brand voice guidelines codified in the prompt, and a list of "banned phrases" that kill slop at the source. Without that, everything else is infrastructure serving a mediocre generator.

Pydantic as contract between agents

Every agent talks to the next via typed Pydantic models. Not strings, not free-form JSON: BaseModel with required fields and specific types.

In practice, this changes how you work with the pipeline:

If an agent produces malformed output, the system fails immediately—no silent error propagation downstream
Communication between agents is documented by the code itself—reading the Pydantic models you know exactly what enters and exits each phase
Debugging is trivial: when something breaks, you know exactly which agent produced invalid output and which field was wrong

The main models—Keyword, SerpResult, ContentStructure, FAQ, SEOAnalysis, ArticleOutput—form a dependency chain that mirrors the pipeline flow. The ProjectSession contains them all, recording the complete workflow state.

Workflow persistence and resumption from interruptions

Every pipeline execution creates an SQLite session. Every agent output gets serialized and saved progressively. If the workflow is interrupted—by a checkpoint rejection, a network error, or simply because you need to stop—the state is preserved.

This means a workflow on a competitive keyword, which might take 10–15 minutes of scraping and analysis, doesn't lose work. The ProjectSession records timestamps, state of every agent, and exact interruption point.

The final output is a JSON file with the full structure: ArticleOutput (article + meta tags + score) + ProjectSession (all intermediate data). Everything traceable, everything reproducible.

Results and lessons learned

What works well

The pipeline consistently produces articles that respect competitor structure—not sometimes, but systematically
The 5 human checkpoints catch problems QA automation misses: wrong tone of voice, irrelevant sections, subtle keyword stuffing
SQLite persistence has saved the workflow dozens of times—especially when Bright Data times out on competitive SERPs
Pydantic eliminated an entire class of bugs: no silent errors from malformed JSON between agents

What would improve it

GPT-4.1 Mini is good for structured tasks but doesn't excel at editorial quality—a model with stronger Italian training (like Claude) would improve agent #7's output
Sequential pattern adds latency: 8 steps in series means 10–15 minutes per article. Steps 2 and 3 (Google Suggest and SERP) could parallelize without losing coherence
QA scoring is still primitive: keyword score is a hardcoded placeholder, structural analysis functions exist in code but aren't wired to the pipeline. QA essentially only evaluates length
The system has no cross-session memory: each article starts from zero, without learning from patterns in approved articles

The real advantage: structure, not speed

An expert copywriter writes faster. But the system guarantees consistency across every article: real competitor analysis, structure derived from data, length calibrated against competitors, and human review at every critical node. Not sometimes. Always.

What the system really does is codify into repeatable, controllable steps the decisions an SEO makes by intuition. Human checkpoints serve that purpose: bringing expertise into a process that would otherwise run on autopilot.

Next steps

Some evolutions are already in prototype.

Partial parallelization: Google Suggest and SERP Scraper agents can work in parallel, cutting total time by 20–25%
Semantic QA: integrate an embedding model to measure coverage of relevant entities in the article, not just keyword density
Cross-session memory: save approved articles and use them as few-shot examples for the Content Generator, improving editorial quality over time

To dive deeper into the orchestration patterns behind this system, read my article on AI agent workflow patterns for SEO. And if you're interested in multi-agent architecture at a broader level, the

paper on Agent Teams in Claude Code covers the subject in depth.

If instead you're curious how AI applies to product catalog search, the paper on RAG for product catalogs tackles a different problem (retrieval, not generation) with the same critical approach.

Frequently Asked Questions