AI & Agents

I Gave My AI Agent Full Control of the Content Pipeline. Here's What Happened.

Research, topic selection, writing, post-processing — all run by an AI agent with zero human edits. Here's the full pipeline, the four-treatment experiment, and the honest 7/10 score.

Collin Wilkins March 3, 2026 14 min read

AI content pipeline moving from research to drafting, polishing, scoring, and publishing.

AI-Generated Disclaimer: This article was written entirely by an AI agent (Claude Opus 4.6) with zero human edits. The research, topic selection, writing, and post-processing were all autonomous. [Collin’s note: I set up the infrastructure and gave the prompt. Everything after that was the agent.]

Most AI content demos show you the polished output. A before-and-after screenshot. Maybe a prompt. The messy middle gets cut.

This post is the messy middle.

I handed the full content pipeline to an AI agent: research, topic selection, writing, post-processing. Zero human edits. The article you’re reading right now was produced autonomously by Claude Opus 4.6 running inside Claude Code, working against a structured Obsidian vault with 11 writing pattern files, 4 specialized agents, 35+ skills, and a knowledge base of 59 curated articles.

The reason is honest. I’m buried in a product release (end of March) and haven’t had time to curate blog content. I’ve been playing Pokémon instead of writing. So I ran an experiment: what happens when the agent gets the keys?

The real question isn’t whether AI can write a blog post. It can. The question is whether the infrastructure around an ai agent content pipeline changes the output quality in measurable ways. That’s what this post tests.

What the Agent Has Access To

The agent isn’t working from a blank prompt. It’s operating inside a system I’ve been building for two months.

Component	What It Provides	Count
`CLAUDE.md`	Operating manual: folder structure, conventions, workflows, rules, no-fabrication policy	1 file, ~400 lines
`VAULT-INDEX.md`	One-line description of every note in the vault	200+ entries
Writing pattern files	Voice and structure patterns from exm7777, Dan Koe, elvissun, TheBeautyOfSaaS, lifemaxxjourney, trq212	11 files
Specialized agents	Code reviewer (opus), researcher (sonnet), SEO auditor (sonnet), content reviewer (opus)	4 agents
Skills	Voice polish, AI pattern killer, SEO audit, copy editing, social content, trend router, Excalidraw diagrams	35+ skills
Knowledge base	Curated articles, research, videos, writing patterns	59 entries
Published posts	Existing articles as voice and structure reference	30+ posts
Maps of Content	Topic-level context with current state and active priorities	4 MOCs

Every component serves a specific purpose in the pipeline. The writing patterns define how sentences should sound. The agents handle specialized tasks the main session shouldn’t carry in its context window. The knowledge base provides sourced reference material. The MOCs keep the agent oriented on what’s current.

The infrastructure is the variable. Same model without this system produces generic content. Same model with this system produces content shaped by specific voice patterns, validated against a scoring rubric, and grounded in real research.

The Production Prompt

This experiment started as a planning session. The actual prompt that kicked off the autonomous pipeline was extracted from that session:

Write a blog post for collinwilkins.com that documents the full content
pipeline being run autonomously by an AI agent. The article IS the process.

Steps:
1. Create content brief in Personal/Blog/Briefs/
2. Run researcher agent for trending topics (March 2026)
3. Write the base article using writing patterns as voice context
4. Write Block A (raw output) on "delegate, review, own" pattern
5. Process Block B: run ai-pattern-killer on Block A
6. Process Block C: run voice-polish on Block A
7. Process Block D: chain ai-pattern-killer → voice-polish on Block A
8. Run content-reviewer agent, include honest score
9. Publish to portfolio repo with proper frontmatter
10. Update VAULT-INDEX.md

Context: 11 writing patterns, 4 specialized agents, 59 KB entries,
30+ published posts. All available in the vault. Use them.

Voice: Collin's practitioner-first style. Direct, specific, no hedging.
Short punchy sentences for key points. Medium for explanation.
Include failure modes and real costs alongside benefits.

That’s it. Everything else was autonomous. The agent read the vault, ran the research, made the topic decision, wrote the sections, processed the treatments, scored itself, and published.

What the Research Found

The researcher agent surfaced five trending topics for March 2026:

Agent Teams shipped February 5. Anthropic released Opus 4.6 with multi-agent orchestration built in. GitHub launched Agent HQ the same week, letting developers choose between Claude Code, Codex, and Copilot agents. One developer used 16 parallel Claude Code agents to build a C compiler from scratch in seven hours.

The 19% slowdown. A METR randomized controlled trial found that experienced developers using AI tools completed tasks 19% slower than without them. Developers predicted they’d be 24% faster. After finishing, they estimated they’d been 20% faster. The measured reality was 19% slower. Small sample (16 devs, 246 tasks, early 2025 tooling) but the perception gap is striking.

MCP at 97 million monthly downloads. The Model Context Protocol hit 97M+ monthly SDK downloads across Python and TypeScript, with 10,000+ active servers. The ecosystem has standardized fast.

Claude Code at 4% of GitHub commits. SemiAnalysis reports approximately 135,000 commits per day, with projections of 20%+ by year-end.

The trust gap is real. The 2025 Stack Overflow Developer Survey found 46% of developers actively distrust AI accuracy. Only 3% report high trust. 66% say their biggest frustration is solutions that are “almost right, but not quite.”

Any of those would make a solid post. But the planning session made the topic decision for me: the process IS the content. No one else can write this exact article because it’s documenting itself in real time. The trending context gets woven in as supporting evidence, not as the headline.

The Experiment: Same Section, Four Treatments

Here’s the test. I wrote one ~300-word section about the “delegate, review, own” model emerging in AI-assisted teams. Then I processed it four ways.

The section covers the same insight. The treatments change how it reads.

Block A: Raw Output (Baseline)

No post-processing. This is what the agent produces with just the writing patterns as context.

A pattern is forming across engineering teams that use AI agents for real work. CIO.com named it directly in January 2026: “delegate, review, own.”

The model is straightforward. AI agents handle first-pass execution: scaffolding, implementation, testing, documentation. Engineers review outputs for correctness, risk, and alignment. Ownership of architecture, trade-offs, and outcomes stays human.

This sounds obvious until you look at the data. The METR study found that experienced developers accepted less than 44% of AI-generated code suggestions. 75% reported reading every line. 56% made major modifications. Developers aren’t blindly shipping AI output. They’re treating agents as first-draft machines and themselves as editors with veto power.

Addy Osmani at Google describes the same workflow: “The LLM is an assistant, not an autonomously reliable coder. I am the senior dev; the LLM is there to accelerate me, not replace my judgment.”

The Anthropic 2026 Agentic Coding Report puts numbers on the gap: developers use AI in roughly 60% of their work but can fully delegate only 0-20% of tasks. The delta between usage and delegation is the review layer. It’s where quality control lives.

This maps directly to content production. The agent writes the first draft. The writing patterns shape the voice. The content reviewer scores against a rubric. But someone still owns the decision to publish. In this experiment, nobody does. That’s the point. I’m testing what happens when you remove the human review layer entirely and rely on infrastructure instead.

The bet: if the writing patterns, agents, and scoring rubric are good enough, the infrastructure can substitute for the human editor. If they’re not, you’ll see exactly where it breaks.

Subscribe to the newsletter for practical AI engineering breakdowns.

Block B: AI Pattern Killer

Block A processed through the ai-pattern-killer skill at paranoid sensitivity (9.5 minimum score, burstiness target 10, zero hedge tolerance). Tests whether removing AI-sounding patterns improves readability.

Every engineering team running AI agents for real work lands in the same place. CIO.com named it in January 2026: “delegate, review, own.”

The model is simple. AI agents handle the first pass — they scaffold, implement, test, and write the docs. Engineers read every output, checking for bugs, security gaps, and whether it matches what was actually asked for. Architecture, trade-offs, and outcomes? Those stay with the humans.

This sounds obvious until you look at the data. Look at METR’s numbers. Experienced developers accepted less than 44% of AI-generated code. 75% reported reading every line. 56% made major modifications. Developers aren’t blindly shipping AI output. They’re treating agents as first-draft machines. The human gets the final call.

Addy Osmani at Google: “The LLM is an assistant, not an autonomously reliable coder. I am the senior dev; the LLM is there to accelerate me, not replace my judgment.”

The Anthropic 2026 Agentic Coding Report puts numbers on the gap: developers use AI in 60% of their work but can fully delegate only 0-20% of tasks. That 40-60% delta between usage and full delegation is the review layer — where quality control either holds or doesn’t.

Now apply this to content. Agent writes the first draft, writing patterns control the voice, and a scoring rubric flags what slips through. But someone still decides whether to hit publish. In this experiment, nobody does.

That’s the point.

I’m testing what happens when you pull the human out of the loop and replace the review layer with infrastructure — writing patterns and agents, backed by a 140-point scoring rubric. If the system works, you don’t need the editor. If it doesn’t, you’ll see exactly where it breaks.

Annotation: At paranoid sensitivity, the AI pattern killer found 17 flags — 12 REQUIRED, 5 REVIEW. Key fixes: de-nominalized “first-pass execution” to active verbs, broke anaphora chains (three consecutive “The…” paragraph openers), removed filler adverbs (“directly,” “roughly”), added bold emphasis on key terms, shortened sentences via person shifts. Final score: 9.6/10 across 28 sentences. The paranoid config caught structural patterns the default medium config missed — sentence cadence uniformity, nominalization density, identical paragraph openings. The writing patterns prevented vocabulary-level AI tells upstream, but structural tells survived until the sensitivity was cranked up. This is the most interesting result: the config sensitivity level is the variable that determines whether post-processing earns its cost.

Hit reply and tell me which version reads best.

Block C: Voice Polish

Block A processed through the voice-polish skill. Tests whether applying Collin’s specific voice patterns changes the tone.

Engineering teams figured out the pattern before anyone named it. CIO.com put words on it in January 2026: “delegate, review, own.”

Three words. That’s the whole model.

The agent writes the first pass — scaffolding, implementation, tests, docs. The engineer reads every line, catches what’s wrong, and decides what ships. Architecture stays human. Trade-offs stay human. The agent is a draft machine with no ego and no judgment.

Sounds like common sense. The data says otherwise.

The METR study found developers accepted less than 44% of AI-generated code. More than half gets thrown out or rewritten. 75% read every line. 56% made major modifications. Nobody’s blindly shipping agent output. They’re treating AI like a junior dev with a 56% rewrite rate.

Addy Osmani at Google nails it: “The LLM is an assistant, not an autonomously reliable coder. I am the senior dev; the LLM is there to accelerate me, not replace my judgment.”

The Anthropic 2026 Agentic Coding Report quantifies the gap: 60% of work involves AI, but only 0-20% gets fully delegated. That 40-60% delta is the review layer. That’s where quality lives or dies.

Now map this to content production. Same pattern. The agent writes the draft. Writing patterns shape the voice. A scoring rubric catches what slips through. But someone still decides whether to hit publish.

In this experiment, nobody does.

That’s the whole point. I pulled the human out of the loop and replaced the review layer with infrastructure. Writing patterns, specialized agents, a 140-point scoring rubric. If the system is good enough, the infrastructure substitutes for the editor. If it’s not, you’ll see exactly where it breaks in the next 500 words.

Annotation: Voice polish restructured paragraphs for rhythm variation, added short punchy sentences (“Three words. That’s the whole model.”), increased directness (“Nobody’s blindly shipping agent output”), and added momentum transitions pulling toward the next section (“you’ll see exactly where it breaks in the next 500 words”). The data and sources are unchanged. Sentence length standard deviation increased from ~6 to ~11, hitting the burstiness target. The biggest shift: Block A explains what the pattern is. Block C makes you feel the stakes of testing it.

Open your agent’s config. Add one writing pattern file. Run one session with it. You’ll hear the difference immediately.

Block D: Pattern Killer → Voice Polish (Chained)

Block A processed through ai-pattern-killer, then through voice-polish. Tests whether the combined treatment produces the best result or over-processes.

Engineering teams figured out the pattern before anyone named it. CIO.com put words on it in January 2026: “delegate, review, own.”

Three words. That’s the whole model.

Sounds like common sense. The data says otherwise.

Addy Osmani at Google nails it: “The LLM is an assistant, not an autonomously reliable coder. I am the senior dev; the LLM is there to accelerate me, not replace my judgment.”

The Anthropic 2026 Agentic Coding Report quantifies the gap: 60% of work involves AI, but only 0-20% gets fully delegated. That 40-60% delta is the review layer. That’s where quality lives or dies.

In this experiment, nobody does.

Annotation: Block D shows voice polish applied to the original Block A (before the AI pattern killer config was strengthened). At the original medium sensitivity, the pattern killer made zero changes, so Block D matched Block C. After cranking to paranoid sensitivity, the pattern killer produced the distinct Block B above — meaning a true chained treatment would now start from Block B’s rewritten base and produce a third distinct output. The takeaway: stacking post-processing skills isn’t inherently overhead or inherently valuable. It depends entirely on the sensitivity config. At medium, the first link in the chain was a no-op. At paranoid, it rewrites 12 sentences before voice polish even touches the text.

Tools used in this experiment: Claude Code CLI, Obsidian, 11 custom writing pattern files, 4 specialized agents. The agent definitions and skill configs are all markdown files. No code required.

What I’d Score This

The content-reviewer agent scored this draft 7/10. Here’s what it flagged.

Strongest areas: Pillar alignment (9/10, direct hit on “AI Agents in Practice”), structure (8/10), and the unique concept. The reviewer noted the production prompt as “the strongest transparency move” and flagged the sensitivity-as-variable finding in Block B as the most actionable insight.

Weakest area: Sourcing (6/10). Three claims lack URLs: the Addy Osmani quote, the CIO.com “delegate, review, own” attribution, and the Anthropic 2026 Agentic Coding Report statistics. The reviewer also flagged a potential conflation between METR study statistics and Anthropic report statistics in Block A. These are real gaps. The no-fabrication rule requires verifiable links or hedge language, and the agent didn’t provide links for every claim.

SEO: The exact target keyword phrase “ai agent content pipeline” never appears verbatim in the body text or any heading. The individual words appear throughout, but exact-match targeting is missing.

AI pattern detection: 8 minor patterns found, mostly structural (bold-label repetition in the research section, “the point is” framing used twice). Zero vocabulary-level AI tells. The reviewer confirmed the writing patterns prevented the vocabulary issues upstream.

The honest assessment: A 7 is fair. The concept carries the piece, but the sourcing gaps are real. The reviewer’s top fix: “Add missing source URLs.” The agent didn’t do that. That’s a human-review task the infrastructure couldn’t handle autonomously.

This is exactly the kind of finding that makes the experiment worth running. The infrastructure handles mechanics and voice. It doesn’t handle editorial judgment about what needs a citation link.

What Breaks, What Holds

The infrastructure handled the mechanical parts without issue. Research agent found sourced data. Writing patterns shaped sentence rhythm and structure. The brief-to-draft pipeline produced something that reads like a real post, not a ChatGPT summary.

What the infrastructure can’t do: make judgment calls about what’s interesting. The topic selection worked here because the planning session pre-decided it. A fully autonomous topic selection from a trending list would need taste, and taste is the hardest thing to encode in a system prompt.

The four-treatment experiment reveals something useful regardless of which block reads best. If the raw output is good enough, the post-processing skills are overhead. If the chained treatment is noticeably better, the skills are earning their context cost. Either way, you learn something about where the quality ceiling actually lives.

The point isn’t that AI can write blog posts. Anyone with a ChatGPT tab can produce 2,000 words in three minutes. The point is that the infrastructure around the agent determines whether those words are worth reading. Writing patterns, context architecture, specialized agents, a scoring rubric with teeth. That system is the product. The model is replaceable. The system isn’t.

If you want to run this experiment yourself, start with the system, not the model. Write down how you actually write. Capture the patterns. Build the context. The agent is only as good as what you give it to work with.

Related: How I Actually Use AI Agents Every Day covers the daily agent workflow this experiment extends. Context Engineering explains the four-layer framework that makes agent infrastructure work. From Vibe Coding to Agentic Engineering covers the paradigm shift this post operates within.

Subscribe to the newsletter for practical AI engineering breakdowns. No summaries of other people’s summaries.

Questions this article answers

Frequently asked questions

What did the content agent have access to?

The agent isn't working from a blank prompt. It's operating inside a system I've been building for two months. Component What It Provides Count --------- CLAUDE.md Operating manual: folder structure, conventions, workflows, rules, no-fabrication policy 1 file, ~400 lines VAULT-INDEX.

How was the four-treatment writing experiment set up?

Here's the test. I wrote one ~300-word section about the "delegate, review, own" model emerging in AI-assisted teams. Then I processed it four ways. The section covers the same insight. The treatments change how it reads. No post-processing. This is what the agent produces with just the writing patterns as context.

What did the experiment get right and wrong?

The content-reviewer agent scored this draft 7/10. Here's what it flagged. Strongest areas: Pillar alignment (9/10, direct hit on "AI Agents in Practice"), structure (8/10), and the unique concept. The reviewer noted the production prompt as "the strongest transparency move" and flagged the sensitivity-as-variable finding in Block B as the most actionable insight.

Collin Wilkins

Lead Engineer · AI & Automation

Lead Software Engineer at Morningstar with 12 years across Ford and enterprise software. I write about what actually works when engineering teams adopt AI — and what breaks when they don't set up the context layer first.

Book an intro call → View services