Task Router Agent — Siyao Zhang

Part 1

The Problem

Context: Running four concurrent tracks (content production, a first client architecture-review brief, portfolio development, coursework) as a solo operator. Task state lived in scattered tools; re-entering context after every session break cost 15–30 minutes of warm-up before any real work.

Product constraint: I wanted a task system where I could add work at any time, and tasks assigned to automation would process themselves without requiring me to be at the machine. Close the laptop, come back hours later, find work done.

Why this matters for agent design: Most "AI assistants" stop when the chat window closes. They’re attached to a session, not to the work. Building an agent runtime that’s independent of session lifecycle is a fundamentally different design problem — and the core shift required for agents to act as infrastructure rather than as tools.

The design question: Can a single plain-text file, paired with a state-machine protocol and an LLM decision layer, serve as a persistent agent runtime? If yes, the same pattern generalizes to any operational workflow — support triage, content pipelines, ops automation.

Part 2

System Architecture

Task Routing Flow

Trigger · task-board.md edited (any source: me, another agent, a cron)

↓ fswatch · 2s debounce · launchd hourly fallback

Quiescence Gate · 30-second idle required before processing (prevents mid-edit race conditions)

↓ Acquire file lock · mkdir atomic primitive

LLM Decision Layer · Parse task-board state machine · classify each task: ready / blocked:* / running:* / done / failed:*

↓

ready → parallel execution (up to 3 concurrent subagents)

blocked / done → skip without re-evaluation

running > 30min → reset to ready (dead-lock recovery)

↓ Single-writer protocol · main agent consolidates state updates

State Update + Append-Only Log · task-board.md writeback · events.jsonl structured log · deadletter/ for terminal failures

Layer Breakdown

L5 Control Plane task-board.md Markdown as UI & source of truth

L4 Scheduler fswatch + launchd Event-driven + time-based fallback

L3 Executor bash + claude -p Stateless invocation · one-shot

L2 Decision Layer task-check skill (LLM) State-machine routing & handoff

L1 State & Audit events.jsonl + deadletter/ Append-only log & failure capture

Part 3

Key Design Decisions

Four deliberate trade-offs. Each with a specific alternative I rejected.

1. Markdown as the Control Plane, not Notion / Airtable / SQL

RejectedStructured databases (Notion, Airtable, Postgres). Schema-enforced, queryable, multi-user ready.

ChosenA single Markdown file with YAML-ish front-matter per task.

ReasoningLatency-to-edit < 1 second, zero vendor dependency, git-versioned, human-readable without tooling, LLM-parseable without schema inference. The cost — no schema enforcement — is recoverable by the state-machine layer (see decision 2).

When to revisitIf multi-operator concurrent editing becomes a hard requirement, or task volume exceeds ~500 entries where Markdown parsing becomes bottleneck.

2. Typed State Machine, not Natural-Language Status

RejectedFree-form status strings ("waiting for client", "in progress", "blocked on API").

ChosenTyped state tokens: ready / blocked:siyao:send-brief / running:agent-a4b:2026-04-20T15:25Z / done / failed:exhausted_retries

ReasoningA typed state is a protocol, not a label. The LLM parsing it can reason about dependency chains, deadlock conditions, and timestamp validity — not just pattern-match strings. Biggest single impact on system intelligence (see Part 5).

CostOne-time migration of all existing tasks (13 cards) + discipline to maintain taxonomy. Small upfront, compounding downstream.

3. Single-Writer Concurrency, not Distributed Locks

RejectedMulti-agent concurrent writes to task-board.md with distributed locking (flock, file-append patterns, CRDT).

ChosenStrict single-writer protocol: only the main agent writes state. Subagents execute tasks in parallel and return structured results; main agent serializes state updates.

ReasoningMarkdown files are not transactional. Concurrent writes cause lost updates, structural corruption, or unparseable states — cascading failures across the system. Single-writer limits parallelism bandwidth but eliminates an entire class of race-condition bugs at zero runtime cost.

Trade-offExecution parallelism is preserved (up to 3 concurrent subagents); state serialization becomes the coordination bottleneck only if task throughput exceeds ~1/sec sustained. Not a real constraint for this workload.

4. Quiescence Gate, not Simple Debounce

RejectedFixed 2-second debounce on file watcher events.

ChosenWatcher debounce (2s) plus a quiescence gate: require 30 seconds of file idle before the executor proceeds. Polls every 5s; resets if any change detected.

ReasoningA user mid-edit can trigger the watcher 5–10 times during one logical "save." Without quiescence, the executor races the user; state updates collide with in-progress edits. The 30s gate aligns automated processing with the user’s natural pause rhythm.

Cost30-second latency floor on processing. Acceptable for asynchronous task flow; not appropriate for real-time chat agents.

5. No RAG, No Fine-Tuning — Context Fits in a Prompt

RejectedRetrieval-augmented generation over task history, or fine-tuning a model on past task decisions.

ChosenLoad the full task-board.md (typically 1–3K tokens) directly into each decision call. No vector DB, no embeddings, no training.

ReasoningRAG earns its cost when the knowledge base exceeds context window, or when retrieval precision itself is the product. Neither applies here: the full state is short, cheap to include, and the decision layer needs all active tasks to reason about dependency chains — partial retrieval would break that. Fine-tuning is the wrong tool for routing logic that changes weekly.

When to revisitIf task history > ~50K tokens sustained, or if cross-workflow queries need semantic search. Until then, the bottleneck isn’t retrieval — it’s the state-machine contract.

Part 4

Measurable Results

Baseline tracking started 2026-04-20. Numbers reflect the first runtime week; some are estimated from initial runs with methodology noted.

~2s

File-change detection

Measured from events.jsonl — time from save to file_changed_triggered event.

30s

Pre-execution gate

Deliberate idle-requirement to prevent partial-edit execution.

~30s

End-to-end task resolution

For deterministic tasks (state routing, status checks). Subagent tasks vary by workload.

<30min

Dead-lock recovery

Self-healing: stuck running:* states auto-reset. Zero manual intervention required.

~$10-30

Monthly runtime cost

LLM API only. No separate hosting cost (runs on laptop via launchd).

100%

Critical-action approval rate

All destructive / irreversible actions gated behind human-in-the-loop by design, not by bug.

Efficiency Gains (Conservative Estimate)

Before the system: manual task-list check & routine research tasks consumed an estimated 8–12 hours / week. After the system: routine tasks (competitive research, documentation drafting, market scans) are routed to parallel subagents without requiring real-time supervision.

Observed first-week output: 5 substantive research / writing tasks completed without active supervision (representative examples: 260-line competitor analysis, 895-line architectural case study, 18-JD market scan, HR-perspective portfolio review).

Methodology: hours-saved estimates from self-reported pre-baseline. Accurate quantification begins week 2 (ongoing).

Part 5

Emergent Behavior (The Core Insight)

I programmed the decision layer to do one thing: find tasks marked ready and execute them. Skip everything else.

On the first production run, it did something I didn’t program it to do:

Caught a semantic inconsistency I had written: a task was marked blocked:task:TASK-015, but TASK-015 had status done. The skill flagged the dependency chain was broken, suggested promoting the task to ready, and asked whether that matched my intent.
Identified scope ambiguity in a task I had marked ready: it compared the task’s Measure field against available inputs, noticed a required artifact (portfolio URL) was missing, and refused to execute — deferring to me for clarification.
Detected a timezone bug in a status timestamp I’d written manually: running:claude-a:2026-04-20T15:25Z was local time mislabeled as UTC. The skill computed that this timestamp was 2 hours in the future relative to now, flagged it as an anomaly, and asked for clarification.

None of these behaviors were in the prompt. They emerged from the state-machine design.

Same LLM. Same task-board. Different agent behavior, depending on whether status is a label or a protocol. A natural-language status field ("waiting", "in progress") gives the model a token to match. A typed state with structured sub-fields gives the model a protocol to reason about. The takeaway I’d package for a product team: structured state is what turns agents from execution bots into reasoning partners — not better prompts.

This has implications for how agent systems should be designed. Most teams iterate on prompts when agent behavior is inadequate. The higher-leverage move is often to change the type system that the agent operates within.

Part 6

Generalization & Next Steps

Pattern Transfer

The architecture is deliberately not task-automation-specific. The same five-layer pattern (control-plane file + event-driven scheduler + stateless executor + typed state machine + append-only audit log) maps cleanly to several workflow categories listed in a current client brief (Shopify operator) I’m scoping for architecture review:

Workflow	Control Plane	Decision Layer Task
Support ticket triage	Helpdesk ticket queue	Classify + draft reply + flag for review
Pre-order management	Supplier ETA tracker	Detect delta + generate customer notices
Product creation from supplier data	Inbound spreadsheet queue	Extract fields + normalize + push to draft
Collection-buying intake	Inbound inquiry log	Extract + score + draft follow-up

The same state-machine discipline (ready / blocked:* / running:* / done) applies to all of them. The only workflow-specific component is the prompt template in the decision layer. Infrastructure is shared.

Planned Evolution

Observability surface: local HTML dashboard served via launchd showing task throughput, error rates, and latency distribution over time.
Prompt version control: decision-layer prompts moved to a separate git repository with semantic versioning & regression tests against a captured input corpus.
MCP-compatible handoff protocol: standardized result schema aligned with the Model Context Protocol, so subagents can chain without main-agent mediation when tasks are independent — and so external MCP-speaking agents could plug into this runtime without protocol translation.
State persistence upgrade: SQLite alongside Markdown if cross-workflow querying becomes a recurring need (Markdown remains the human UI).

What I’d Do Differently

Instrument metrics from day one, not day seven. Even rough counters in events.jsonl are worth more than precise metrics added later — retroactive measurement forces estimation.
Document the state machine before coding it. I migrated from natural-language status to typed states after running into parsing ambiguities. The right order was: state machine on paper → skill design → code. I did it reversed.

Building this system taught me one thing I can take to any agent product: the state machine is the product. Prompts, subagents, scheduling layers — all replaceable. The state contract between human intent and agent action is what persists, scales, and determines whether the system can reason or only execute.

Part 7

Build vs. Buy — Why I Didn’t Use Coze, Dify, or n8n

Scope honesty first: this is a single-operator runtime I built for my own task flow, not a platform product. It solves the specific shape of one person orchestrating a handful of workflows. The question I had to answer before writing any code was not "build vs. launch a competitor to Coze" — it was "does an existing tool already solve this shape, and if so, which?" The analysis below is what I did before deciding to build.

What I was optimising for

Operator = me, single writer. No multi-user editing. No customer-facing UI. No handoff to a non-technical PM.
State is already Markdown. Everything I track — tasks, journals, project notes — lives in plain text files inside a git repo. Any tool that wants me to re-enter that state into a canvas is a non-starter.
Edits at the speed of a keystroke. I wanted to be able to add a task from any editor, any terminal, and have the runtime notice within seconds. No form submission, no save-and-deploy cycle.
Typed state the LLM can reason about. Not a colour tag or free-text status, but a protocol with failure modes the model can inspect (blocked:task:TASK-010 means TASK-010 isn’t done yet; the model can check).

What each option would have given me — and why I passed

Coze (ByteDance) — visual agent builder, cloud SaaS

StrengthsFastest path to a shipped agent for a non-technical team. Rich pre-built tool integrations inside the ByteDance ecosystem.

Why I passedMy state is already Markdown + git. Using Coze means re-creating that state in a canvas UI and accepting vendor lock-in. For a solo operator who lives in the terminal, the canvas is friction, not help.

Dify (open source) — LLM app platform, self-hosted

StrengthsRAG, agent, and dataset management in one place. Good choice for a team building a product that bundles several AI capabilities.

Why I passedDify is aimed at shipping a user-facing product; it pays its complexity tax in features I don’t use (multi-tenant datasets, conversation management, embedded UI). For a personal task runtime, Docker + Postgres + Redis + a web UI is overkill.

n8n (open source) — node-graph workflow automation

Strengths400+ pre-built integrations. Best choice if the AI step is one node inside a larger multi-SaaS pipeline.

Why I passedThe AI decision isn’t one node in my workflow — it is the routing layer. n8n treats LLMs as yet another integration; my design treats the typed state as the first-class input the LLM reasons about. Different centre of gravity.

The common decision

All three platforms abstract the workflow behind a UI and a schema you configure through that UI. That’s the right call when the operator is non-technical, when state needs to be shared across a team, or when the workflow is the product. None of those apply here. For a solo operator whose workflow state is already text, the cost of introducing a second state system (canvas + vendor storage) outweighs the convenience of pre-built nodes.

Put differently: the reason this runtime is ~300 lines of shell + a skill definition, rather than a deploy of Dify, is not that I think I’d build it better. It’s that my shape (one operator, Markdown-native, terminal-native, zero-vendor) is the shape that these platforms explicitly don’t optimise for. For their actual audience — cross-functional teams shipping user-facing agents — they’re the right choice, and I’d recommend them.

The transferable skill isn’t "I built a better platform." It’s: before writing a line of code, I named the four things I actually needed, mapped each existing tool against them, and picked the build option only when every tool failed a specific criterion. That’s the same build-vs.-buy frame I’d apply to any agent platform decision in a product role — including recommending against a custom build when an existing tool covers the scope.

Siyao Zhang · Available from September 2026, Beijing · Contact

Task Router AgentA State-Machine Orchestration Layer on a Markdown File