I designed and built a 7-agent content production system with routing logic, scoring frameworks, feedback loops, and data-driven iteration. Personal-production deployment — it runs on my own content pipeline daily, processing real content decisions.
Who: Short-form video creators on Chinese social platforms (Douyin, Xiaohongshu).
What they struggle with: Content creation involves 6–8 sequential decisions — what topic to pursue, whether it's worth making, what structure to use, how to open, what title to write, whether the writing sounds authentic, and what to learn from the results. Each decision requires judgment. Most creators rely on intuition, which doesn't scale and doesn't improve systematically.
The core product question: Can you design an AI agent system where each decision point has explicit criteria, where agents hand off to each other with structured data, and where the system improves its own decision quality over time through a feedback loop?
7 specialised agents, connected through a routing layer. Each agent has:
7 agents, 3 connection types. Solid arrows = sequential handoff; gold/sage arrows = feedback loop through methodology.md; dashed slate lines = Quality Check called in parallel from any generator. The feedback loop is the reason the filter gets stricter over time — not better prompts.
The claim "the system learns" is only meaningful if the structure of decisions changes over time. Below is the methodology.md file before and after three publish-and-review cycles — same structural shape, but the Banned Set (what the Topic Filter rejects) is grown by the Data Review agent, not by me editing the prompt.
Filter approves anything that passes the structural sieves. No data to override intuition with.
Filter now rejects 3 entire content categories before generation begins. The list grew only from Data Review writebacks — the prompt in the Topic Filter agent is unchanged from v1.
The prompt didn't get better. The file the prompt reads got more precise. This is the argument for structured state being the leverage point in multi-agent systems — more than prompt engineering, more than model upgrades.
Pure dispatcher. Identifies user intent in one sentence, routes to the correct agent.
Design decision: The routing layer prevents the system from trying to do everything at once. If multiple needs exist, it forces sequential resolution — a deliberate constraint that mirrors how product teams triage requests.
Binary go/no-go decision. Core belief: 80% of bad content comes from bad topics, not bad execution.
Why this matters: In any AI product, the most expensive mistake is building the wrong thing. This agent prevents the system from investing effort in content that will fail.
| Filter | What it checks | Pass | Kill |
|---|---|---|---|
| 1. Cognitive Gap | Does this version have a reason to exist? | First mover, clearer framework, or first-hand experience | Same as existing content |
| 2. Material Check | What raw material exists? (data, stories, quotes, failures) | 2+ material types | 0 materials = hard stop |
| 3. Three-Layer Test | Info → Framework replacement → Identity | All 3 layers answered | Pure information with no framework |
| 4. Methodology Validation | Matches proven formula? Hits banned type? | Verified formula with historical data | Banned type = stop with data citation |
Key design decision: Filters run sequentially with user check-ins between each — not batch processed. The agent is a filter, not an advocate. It will never help the user rationalise a passing grade.
Critical dependency: Filter 4 reads from the methodology file, which is updated by the Data Review agent. This is the feedback loop — the gate gets smarter over time.
Generate a complete ~2.5 min script after the topic passes the filter.
Most AI writing tools generate and forget. This agent treats every user edit as a signal. After 10+ revision cycles, the output converges toward the user's voice. This is the difference between a tool and a product.
Diagnose content quality first, then generate 10–15 opening options. 90% of bad openings come from bad content, not bad copywriting.
Any factor = 0 means the opening has no force:
| Factor | What it measures | Example |
|---|---|---|
| Prediction disruption | Does the opening break the viewer's default expectation? | During the first few seconds, the viewer can't predict what you're saying next |
| Reward or loss signal | Can the viewer state what they'll get (or miss) within 5 seconds? | "Watch this and you'll get X" / "Scroll past and you'll miss X" |
| Naming | Does the opening label a feeling the viewer has but couldn't articulate? | A new name for a vague feeling — the moment it's named, trust is built |
Why multiplicative, not additive: If prediction disruption is zero (the opening is predictable), it doesn't matter how strong the reward signal is — viewers have already scrolled past. All three must be non-zero.
Formula-driven title matching from 75 validated viral formulas. Every title has a formula number and traceability.
| Category | Mechanism |
|---|---|
| Cognitive Conflict (1–6) | Break existing belief |
| Curiosity Gap (7–12) | Information asymmetry |
| Fear / Loss (13–20) | "Not clicking = losing out" |
| Identity Injection (21–25) | "This is about me" |
| Number Anchoring (26–32) | Reduce cognitive load |
| Result Promise (33–40) | Concrete outcome + timeframe |
| + 6 more categories (Controversy, Scene/Condition, Action Call, Authority, Social Proof, Interaction) | |
Readiness gate: Checks if the production file has all three components (script + opening + title). All present → moves to filming queue. Any missing → blocks and reports.
Detect AI writing fingerprints. 22 patterns, 3 severity levels. Goal: "find your own voice."
Detection: 22 fingerprints (exhaustive counter-arguments, uniform parallel rhythms, zero hesitation, Chinese translation syntax). Each has genre-specific false-positive warnings.
Rewrite mode: Does not rewrite directly. Asks one targeted question per fingerprint: "Which of these parallel phrases is the one you most wanted to say?" The questions probe intent — so the user develops their voice rather than replacing AI patterns with different AI patterns.
This is where the system learns. Record data, run meta-review, extract rules, write back to methodology.
Only for notably above/below average results: phenomenon → content type → hypothesised cause → conclusion → next verification direction. Rules without a test are worthless.
This closes the loop: the methodology file is what the Topic Filter reads. Every published piece updates the criteria that gate the next piece. The system gets more precise over time.
Personal-production deployment — running on my own content pipeline daily.
Structural A/B on content format: 7 content types were tested sequentially against a consistent engagement metric; 3 underperformed across enough cycles that the system removed them from the Topic Filter’s acceptable set. The comparison is structural rather than concurrent — a form of sequential A/B appropriate when content volume is limited.
Methodology file updated after every publish cycle. The system is still running and improving.
| AI Product Skill | Where it shows up |
|---|---|
| Agent architecture design | 7 agents with routing, handoff rules, and input/output contracts |
| Prompt engineering | Each agent has specialised prompt logic (scoring formulas, filters, templates) |
| Evaluation framework design | 3-factor multiplicative hook scoring, 4-layer topic filter, 75-formula title matching |
| A/B testing & experimentation | Structural A/B tests on content format, with controlled variables and metric-based conclusions |
| Feedback loop / iteration | Data Review → methodology write-back → Filter reads updated file |
| Data-driven decision making | 3 content types banned based on metrics, not intuition |
| User research thinking | Three-layer content test (information → framework replacement → identity) |
| Style learning / personalisation | Script agent diffs user edits and updates a persistent style profile |