← Back to portfolio
ENFR
Agent Orchestration · System Design

Task Router Agent
A State-Machine Orchestration Layer on a Markdown File

I designed an LLM-powered task orchestration runtime that routes, executes, and self-heals autonomously. A single Markdown file becomes the agent control plane. Session-independent runtime — edit the file, the work happens in the background regardless of whether any session is open. Below is how the architecture was designed, the trade-offs I made, the measurable results, and the emergent behavior I didn’t program for.

Part 1

The Problem

Context: Running four concurrent tracks (content production, a first client architecture-review brief, portfolio development, coursework) as a solo operator. Task state lived in scattered tools; re-entering context after every session break cost 15–30 minutes of warm-up before any real work.

Product constraint: I wanted a task system where I could add work at any time, and tasks assigned to automation would process themselves without requiring me to be at the machine. Close the laptop, come back hours later, find work done.

Why this matters for agent design: Most "AI assistants" stop when the chat window closes. They’re attached to a session, not to the work. Building an agent runtime that’s independent of session lifecycle is a fundamentally different design problem — and the core shift required for agents to act as infrastructure rather than as tools.

The design question: Can a single plain-text file, paired with a state-machine protocol and an LLM decision layer, serve as a persistent agent runtime? If yes, the same pattern generalizes to any operational workflow — support triage, content pipelines, ops automation.


Part 2

System Architecture

Task Routing Flow

Trigger · task-board.md edited (any source: me, another agent, a cron)
↓  fswatch · 2s debounce · launchd hourly fallback
Quiescence Gate · 30-second idle required before processing (prevents mid-edit race conditions)
↓  Acquire file lock · mkdir atomic primitive
LLM Decision Layer · Parse task-board state machine · classify each task: ready / blocked:* / running:* / done / failed:*
ready → parallel execution (up to 3 concurrent subagents)
running > 30min → reset to ready (dead-lock recovery)
↓  Single-writer protocol · main agent consolidates state updates
State Update + Append-Only Log · task-board.md writeback · events.jsonl structured log · deadletter/ for terminal failures

Layer Breakdown

L5 Control Plane task-board.md Markdown as UI & source of truth
L4 Scheduler fswatch + launchd Event-driven + time-based fallback
L3 Executor bash + claude -p Stateless invocation · one-shot
L2 Decision Layer task-check skill (LLM) State-machine routing & handoff
L1 State & Audit events.jsonl + deadletter/ Append-only log & failure capture

Part 3

Key Design Decisions

Four deliberate trade-offs. Each with a specific alternative I rejected.

1. Markdown as the Control Plane, not Notion / Airtable / SQL
RejectedStructured databases (Notion, Airtable, Postgres). Schema-enforced, queryable, multi-user ready.
ChosenA single Markdown file with YAML-ish front-matter per task.
ReasoningLatency-to-edit < 1 second, zero vendor dependency, git-versioned, human-readable without tooling, LLM-parseable without schema inference. The cost — no schema enforcement — is recoverable by the state-machine layer (see decision 2).
When to revisitIf multi-operator concurrent editing becomes a hard requirement, or task volume exceeds ~500 entries where Markdown parsing becomes bottleneck.
2. Typed State Machine, not Natural-Language Status
RejectedFree-form status strings ("waiting for client", "in progress", "blocked on API").
ChosenTyped state tokens: ready / blocked:siyao:send-brief / running:agent-a4b:2026-04-20T15:25Z / done / failed:exhausted_retries
ReasoningA typed state is a protocol, not a label. The LLM parsing it can reason about dependency chains, deadlock conditions, and timestamp validity — not just pattern-match strings. Biggest single impact on system intelligence (see Part 5).
CostOne-time migration of all existing tasks (13 cards) + discipline to maintain taxonomy. Small upfront, compounding downstream.
3. Single-Writer Concurrency, not Distributed Locks
RejectedMulti-agent concurrent writes to task-board.md with distributed locking (flock, file-append patterns, CRDT).
ChosenStrict single-writer protocol: only the main agent writes state. Subagents execute tasks in parallel and return structured results; main agent serializes state updates.
ReasoningMarkdown files are not transactional. Concurrent writes cause lost updates, structural corruption, or unparseable states — cascading failures across the system. Single-writer limits parallelism bandwidth but eliminates an entire class of race-condition bugs at zero runtime cost.
Trade-offExecution parallelism is preserved (up to 3 concurrent subagents); state serialization becomes the coordination bottleneck only if task throughput exceeds ~1/sec sustained. Not a real constraint for this workload.
4. Quiescence Gate, not Simple Debounce
RejectedFixed 2-second debounce on file watcher events.
ChosenWatcher debounce (2s) plus a quiescence gate: require 30 seconds of file idle before the executor proceeds. Polls every 5s; resets if any change detected.
ReasoningA user mid-edit can trigger the watcher 5–10 times during one logical "save." Without quiescence, the executor races the user; state updates collide with in-progress edits. The 30s gate aligns automated processing with the user’s natural pause rhythm.
Cost30-second latency floor on processing. Acceptable for asynchronous task flow; not appropriate for real-time chat agents.
5. No RAG, No Fine-Tuning — Context Fits in a Prompt
RejectedRetrieval-augmented generation over task history, or fine-tuning a model on past task decisions.
ChosenLoad the full task-board.md (typically 1–3K tokens) directly into each decision call. No vector DB, no embeddings, no training.
ReasoningRAG earns its cost when the knowledge base exceeds context window, or when retrieval precision itself is the product. Neither applies here: the full state is short, cheap to include, and the decision layer needs all active tasks to reason about dependency chains — partial retrieval would break that. Fine-tuning is the wrong tool for routing logic that changes weekly.
When to revisitIf task history > ~50K tokens sustained, or if cross-workflow queries need semantic search. Until then, the bottleneck isn’t retrieval — it’s the state-machine contract.

Part 4

Measurable Results

Baseline tracking started 2026-04-20. Numbers reflect the first runtime week; some are estimated from initial runs with methodology noted.

~2s
File-change detection
Measured from events.jsonl — time from save to file_changed_triggered event.
30s
Pre-execution gate
Deliberate idle-requirement to prevent partial-edit execution.
~30s
End-to-end task resolution
For deterministic tasks (state routing, status checks). Subagent tasks vary by workload.
<30min
Dead-lock recovery
Self-healing: stuck running:* states auto-reset. Zero manual intervention required.
~$10-30
Monthly runtime cost
LLM API only. No separate hosting cost (runs on laptop via launchd).
100%
Critical-action approval rate
All destructive / irreversible actions gated behind human-in-the-loop by design, not by bug.

Efficiency Gains (Conservative Estimate)

Before the system: manual task-list check & routine research tasks consumed an estimated 8–12 hours / week. After the system: routine tasks (competitive research, documentation drafting, market scans) are routed to parallel subagents without requiring real-time supervision.

Observed first-week output: 5 substantive research / writing tasks completed without active supervision (representative examples: 260-line competitor analysis, 895-line architectural case study, 18-JD market scan, HR-perspective portfolio review).

Methodology: hours-saved estimates from self-reported pre-baseline. Accurate quantification begins week 2 (ongoing).


Part 5

Emergent Behavior (The Core Insight)

I programmed the decision layer to do one thing: find tasks marked ready and execute them. Skip everything else.

On the first production run, it did something I didn’t program it to do:

None of these behaviors were in the prompt. They emerged from the state-machine design.

Same LLM. Same task-board. Different agent behavior, depending on whether status is a label or a protocol. A natural-language status field ("waiting", "in progress") gives the model a token to match. A typed state with structured sub-fields gives the model a protocol to reason about. The takeaway I’d package for a product team: structured state is what turns agents from execution bots into reasoning partners — not better prompts.

This has implications for how agent systems should be designed. Most teams iterate on prompts when agent behavior is inadequate. The higher-leverage move is often to change the type system that the agent operates within.


Part 6

Generalization & Next Steps

Pattern Transfer

The architecture is deliberately not task-automation-specific. The same five-layer pattern (control-plane file + event-driven scheduler + stateless executor + typed state machine + append-only audit log) maps cleanly to several workflow categories listed in a current client brief (Shopify operator) I’m scoping for architecture review:

WorkflowControl PlaneDecision Layer Task
Support ticket triageHelpdesk ticket queueClassify + draft reply + flag for review
Pre-order managementSupplier ETA trackerDetect delta + generate customer notices
Product creation from supplier dataInbound spreadsheet queueExtract fields + normalize + push to draft
Collection-buying intakeInbound inquiry logExtract + score + draft follow-up

The same state-machine discipline (ready / blocked:* / running:* / done) applies to all of them. The only workflow-specific component is the prompt template in the decision layer. Infrastructure is shared.

Planned Evolution

What I’d Do Differently

Building this system taught me one thing I can take to any agent product: the state machine is the product. Prompts, subagents, scheduling layers — all replaceable. The state contract between human intent and agent action is what persists, scales, and determines whether the system can reason or only execute.


Part 7

Build vs. Buy — Why I Didn’t Use Coze, Dify, or n8n

Scope honesty first: this is a single-operator runtime I built for my own task flow, not a platform product. It solves the specific shape of one person orchestrating a handful of workflows. The question I had to answer before writing any code was not "build vs. launch a competitor to Coze" — it was "does an existing tool already solve this shape, and if so, which?" The analysis below is what I did before deciding to build.

What I was optimising for

What each option would have given me — and why I passed

Coze (ByteDance) — visual agent builder, cloud SaaS
StrengthsFastest path to a shipped agent for a non-technical team. Rich pre-built tool integrations inside the ByteDance ecosystem.
Why I passedMy state is already Markdown + git. Using Coze means re-creating that state in a canvas UI and accepting vendor lock-in. For a solo operator who lives in the terminal, the canvas is friction, not help.
Dify (open source) — LLM app platform, self-hosted
StrengthsRAG, agent, and dataset management in one place. Good choice for a team building a product that bundles several AI capabilities.
Why I passedDify is aimed at shipping a user-facing product; it pays its complexity tax in features I don’t use (multi-tenant datasets, conversation management, embedded UI). For a personal task runtime, Docker + Postgres + Redis + a web UI is overkill.
n8n (open source) — node-graph workflow automation
Strengths400+ pre-built integrations. Best choice if the AI step is one node inside a larger multi-SaaS pipeline.
Why I passedThe AI decision isn’t one node in my workflow — it is the routing layer. n8n treats LLMs as yet another integration; my design treats the typed state as the first-class input the LLM reasons about. Different centre of gravity.

The common decision

All three platforms abstract the workflow behind a UI and a schema you configure through that UI. That’s the right call when the operator is non-technical, when state needs to be shared across a team, or when the workflow is the product. None of those apply here. For a solo operator whose workflow state is already text, the cost of introducing a second state system (canvas + vendor storage) outweighs the convenience of pre-built nodes.

Put differently: the reason this runtime is ~300 lines of shell + a skill definition, rather than a deploy of Dify, is not that I think I’d build it better. It’s that my shape (one operator, Markdown-native, terminal-native, zero-vendor) is the shape that these platforms explicitly don’t optimise for. For their actual audience — cross-functional teams shipping user-facing agents — they’re the right choice, and I’d recommend them.

The transferable skill isn’t "I built a better platform." It’s: before writing a line of code, I named the four things I actually needed, mapped each existing tool against them, and picked the build option only when every tool failed a specific criterion. That’s the same build-vs.-buy frame I’d apply to any agent platform decision in a product role — including recommending against a custom build when an existing tool covers the scope.


Siyao Zhang · Available from September 2026, Beijing · Contact