I designed an LLM-powered task orchestration runtime that routes, executes, and self-heals autonomously. A single Markdown file becomes the agent control plane. Session-independent runtime — edit the file, the work happens in the background regardless of whether any session is open. Below is how the architecture was designed, the trade-offs I made, the measurable results, and the emergent behavior I didn’t program for.
Context: Running four concurrent tracks (content production, a first client architecture-review brief, portfolio development, coursework) as a solo operator. Task state lived in scattered tools; re-entering context after every session break cost 15–30 minutes of warm-up before any real work.
Product constraint: I wanted a task system where I could add work at any time, and tasks assigned to automation would process themselves without requiring me to be at the machine. Close the laptop, come back hours later, find work done.
Why this matters for agent design: Most "AI assistants" stop when the chat window closes. They’re attached to a session, not to the work. Building an agent runtime that’s independent of session lifecycle is a fundamentally different design problem — and the core shift required for agents to act as infrastructure rather than as tools.
The design question: Can a single plain-text file, paired with a state-machine protocol and an LLM decision layer, serve as a persistent agent runtime? If yes, the same pattern generalizes to any operational workflow — support triage, content pipelines, ops automation.
task-board.md edited (any source: me, another agent, a cron)fswatch · 2s debounce · launchd hourly fallbackmkdir atomic primitiveready / blocked:* / running:* / done / failed:*task-board.md writeback · events.jsonl structured log · deadletter/ for terminal failuresFour deliberate trade-offs. Each with a specific alternative I rejected.
ready / blocked:siyao:send-brief / running:agent-a4b:2026-04-20T15:25Z / done / failed:exhausted_retriestask-board.md with distributed locking (flock, file-append patterns, CRDT).task-board.md (typically 1–3K tokens) directly into each decision call. No vector DB, no embeddings, no training.Baseline tracking started 2026-04-20. Numbers reflect the first runtime week; some are estimated from initial runs with methodology noted.
events.jsonl — time from save to file_changed_triggered event.running:* states auto-reset. Zero manual intervention required.launchd).Before the system: manual task-list check & routine research tasks consumed an estimated 8–12 hours / week. After the system: routine tasks (competitive research, documentation drafting, market scans) are routed to parallel subagents without requiring real-time supervision.
Observed first-week output: 5 substantive research / writing tasks completed without active supervision (representative examples: 260-line competitor analysis, 895-line architectural case study, 18-JD market scan, HR-perspective portfolio review).
Methodology: hours-saved estimates from self-reported pre-baseline. Accurate quantification begins week 2 (ongoing).
I programmed the decision layer to do one thing: find tasks marked ready and execute them. Skip everything else.
On the first production run, it did something I didn’t program it to do:
blocked:task:TASK-015, but TASK-015 had status done. The skill flagged the dependency chain was broken, suggested promoting the task to ready, and asked whether that matched my intent.ready: it compared the task’s Measure field against available inputs, noticed a required artifact (portfolio URL) was missing, and refused to execute — deferring to me for clarification.running:claude-a:2026-04-20T15:25Z was local time mislabeled as UTC. The skill computed that this timestamp was 2 hours in the future relative to now, flagged it as an anomaly, and asked for clarification.None of these behaviors were in the prompt. They emerged from the state-machine design.
Same LLM. Same task-board. Different agent behavior, depending on whether status is a label or a protocol. A natural-language status field ("waiting", "in progress") gives the model a token to match. A typed state with structured sub-fields gives the model a protocol to reason about. The takeaway I’d package for a product team: structured state is what turns agents from execution bots into reasoning partners — not better prompts.
This has implications for how agent systems should be designed. Most teams iterate on prompts when agent behavior is inadequate. The higher-leverage move is often to change the type system that the agent operates within.
The architecture is deliberately not task-automation-specific. The same five-layer pattern (control-plane file + event-driven scheduler + stateless executor + typed state machine + append-only audit log) maps cleanly to several workflow categories listed in a current client brief (Shopify operator) I’m scoping for architecture review:
| Workflow | Control Plane | Decision Layer Task |
|---|---|---|
| Support ticket triage | Helpdesk ticket queue | Classify + draft reply + flag for review |
| Pre-order management | Supplier ETA tracker | Detect delta + generate customer notices |
| Product creation from supplier data | Inbound spreadsheet queue | Extract fields + normalize + push to draft |
| Collection-buying intake | Inbound inquiry log | Extract + score + draft follow-up |
The same state-machine discipline (ready / blocked:* / running:* / done) applies to all of them. The only workflow-specific component is the prompt template in the decision layer. Infrastructure is shared.
launchd showing task throughput, error rates, and latency distribution over time.events.jsonl are worth more than precise metrics added later — retroactive measurement forces estimation.Building this system taught me one thing I can take to any agent product: the state machine is the product. Prompts, subagents, scheduling layers — all replaceable. The state contract between human intent and agent action is what persists, scales, and determines whether the system can reason or only execute.
Scope honesty first: this is a single-operator runtime I built for my own task flow, not a platform product. It solves the specific shape of one person orchestrating a handful of workflows. The question I had to answer before writing any code was not "build vs. launch a competitor to Coze" — it was "does an existing tool already solve this shape, and if so, which?" The analysis below is what I did before deciding to build.
blocked:task:TASK-010 means TASK-010 isn’t done yet; the model can check).All three platforms abstract the workflow behind a UI and a schema you configure through that UI. That’s the right call when the operator is non-technical, when state needs to be shared across a team, or when the workflow is the product. None of those apply here. For a solo operator whose workflow state is already text, the cost of introducing a second state system (canvas + vendor storage) outweighs the convenience of pre-built nodes.
Put differently: the reason this runtime is ~300 lines of shell + a skill definition, rather than a deploy of Dify, is not that I think I’d build it better. It’s that my shape (one operator, Markdown-native, terminal-native, zero-vendor) is the shape that these platforms explicitly don’t optimise for. For their actual audience — cross-functional teams shipping user-facing agents — they’re the right choice, and I’d recommend them.
The transferable skill isn’t "I built a better platform." It’s: before writing a line of code, I named the four things I actually needed, mapped each existing tool against them, and picked the build option only when every tool failed a specific criterion. That’s the same build-vs.-buy frame I’d apply to any agent platform decision in a product role — including recommending against a custom build when an existing tool covers the scope.
Siyao Zhang · Available from September 2026, Beijing · Contact