Muster

The operating system for agentic software teams — that built itself.

A self-hostable agency OS — Claude-side governance, a Directus shared-task bus, local RAG, and a Next.js portal — went from empty repo to a live, public, HTTPS demo in about 2.5 hours, built by a fleet of ~30 AI subagents that used the system's own task board as their spine, caught their own bugs, and scrubbed their own shipping artifacts.

This page is the build log. The links are the live system.

Read the log ↓ Open the live portal ↗ See the build board ↗ Read the whitepaper GitHub (soon) ↗

~2.5 hrsempty repo → live HTTPS demo

30+AI subagents · 6 parallel fan-outs

2M+tokens consumed

27commits · 90 files · 8 phases

~13defects caught by machines, not humans

0secret leaks · 0 collisions / 28 worktrees

$2–4per month to run

§1 · Abstract

On the evening of June 29, 2026, an empty git repository became a live, internet-reachable, self-hostable software product. The arc from scaffold to public demo took about 2.5 hours of wall-clock, spanned 8 phases, 27 commits and 90 files, and was executed by a governed fleet of roughly 30 Claude subagent invocations consuming over 2 million tokens. It runs on a single AWS instance for about $2–4 per month, and you can open it right now.

That an agent fleet emits files quickly is not the interesting part — volume is unfalsifiable and ages badly. The interesting part is the apparatus that let a long, multi-session, ship-grade build hold together without a human reviewing every diff, and the fact that the system used that apparatus on itself. The thesis: the value is the architecture, not the prompt.

Three costs, kept apart on purpose: hosting (~$3/mo) is not the one-time build model-spend, and neither is the ongoing cost of running the fleet on your next project. The cheap number is only the box.

§2 · Governance is the product

The system built itself.

The system that ships the system used its own task board as the build's spine, its own governed fan-out to construct one of its own subsystems, its own adversarial verifiers to keep its sellable template clean, and its own loop-proofs to catch bugs in its own packaging.

Call this dogfooding, not proof of necessity. Self-use shows the apparatus is coherent and usable — that it can carry its own non-trivial build — not that it is necessary or net-positive in dollars. The honesty is the point.

The single architectural claim this rests on:

The bottleneck is not the model

It is shared coordination state

The task record — not the chat transcript —

is the coordination substrate between a human and a fleet of agents.

The board is still browsable, read-only, exactly as the agents left it. cms.34.220.64.149.sslip.io

Origin clause: Muster began as the internal OS of one agency, Analog Elk. That lineage is now just the first profile (analogelk) shipped alongside a generic one — a demonstrated feature, not the headline.

§3 · What Muster is

Four planes around a task board.

Read it as a transferable mental model — you can build all four on any vendor's stack. The architecture figure:

Governance · Claude-side constitution

A numbered, versioned ruleset (CLAUDE.md §1–§10) loaded into every session: secrets are env-only, one trunk equals production, never edit a contested worktree. Plain-English code that makes a stateless model behave like a long-lived teammate.

Task bus · the Directus os_* board

A shared board every agent reads, claims, updates and closes over Directus's native MCP endpoint. The task record is the coordination atom — claimable units of work.

Memory + RAG · the local engine

On-disk memory files are the knowledge substrate (lessons not derivable from code or git); the append-only log is the narrative substrate (what happened, in order). Three substrates, three cadences.

Portal · the human read-side

A Next.js surface where a human watches the same rows the fleet works. Ships two profiles — generic | analogelk — the detail that makes "it's a product, not an internal tool" true at the code level.

§4 · One receipt, end to end

A paper about receipts should show one.

The RAG service claimed port :9100. A naive health check reported "healthy." It was lying — a different engine, already running, was answering on that port, and the check passed for the wrong reason. Here is the whole trail, on the board:

the symptom

RAG :9100 health check reports green

board row

a task opened for the RAG bring-up — claimed, worked, status tracked

verifier's claim

a co-located engine on the same port was flattering the check; "tests pass" ≠ "the system works"

what caught it

a from-zero clean-room loop-proof, where the impostor process was not present — flipped the row red

closing note

root cause recorded as a co-located listener

fix commit

pin the port + assert identity, not just a 200

The thing that caught it was not a smarter agent. It was a from-zero environment. Every other proof section can stay summary because this one is concrete.

§5 · The build, in 8 phases

17:56 → 20:23 PT · 2026-06-29

The Build-Trail: stacked steps, each a verified phase accreting upward, the top step the live system.

P0Scaffold + envempty repo

P1Compose core topologyDocker

P2Schema + seed · 8-agent fan-out546k tok

P3RAG engine:9100 fix

P4Next.js portalhuman read-side

P5Claude-OS wiring · native MCPenv-token

P6Live demo on the boxarm64 fix, live

P7Packaging · scrub-audit20:23 PT

27 commits · 90 files · the spine of every phase was the os_tasks board — deep-linkable, still live.

§6 · The agent fleet

30+ invocations, 6 fan-outs, 2M+ tokens.

Two named set-pieces carry the number:

The 8-agent schema build — 546k tokens — ground → fan-out → scrub-audit → 3 adversarial leak-verifiers.
The 5-agent live-AWS-pricing cost panel — 256k tokens.

The rest were parallel explorers, builders and verifiers spread across P0–P7, several single-shot. Honestly: not all produced kept work. Some fan-outs returned partial or discarded output that was retried or dropped — and that wasted production is part of the real token cost in §7. The discard rate was not instrumented; that is itself a measurement gap.

§7 · Bugs the machines caught

~7 functional bugs, 6 brand leaks. No human caught any of them.

Defect / risk	What caught it	Needed the fleet?
RAG :9100 port collision	from-scratch loop-proof	No — any disciplined agent
Health check fooled by a co-located engine	clean-room reproduction	No — single-agent reachable
Missing releases.repository_id FK	loop-proof	No
set -e / pipefail bug in doctor	loop-proof	No
arm64-vs-amd64 portal-image mismatch (fixed live)	live deploy	No
Compose RAG-off abort	loop-proof	No
AE seed files colliding with a kept UI collection	scrub-audit	Partly (seam)
6 real AnalogElk references in the sellable template	scrub-audit + 3 adversarial leak-verifiers	Yes — multi-tenant policing

Most functional catches are loop-proof wins a single disciplined agent could reach. The genuinely fleet-dependent win is the system policing its own multi-tenancy before shipping a sellable generic template. Honestly: the brand leaks are string-checkable — a grep catches those too. The fleet's marginal value is the seam and the semantic cases. The scrub is a capability, not a confession.

§8 · Governance & safety

The honesty ledger is the design feature.

0 secret leaks. DB and admin tokens were always env-referenced, never echoed, never committed — grep-verified, not assumed. On the 28 worktrees: we avoided the contested checkout entirely by building read-only from a frozen snapshot. 0 collisions by not participating in the contention, not by handling it.

shipped — demonstrated

aspirational — not yet

self-built in ~2.5 hrs, live + reachable

a measured A/B against the control

0 secret leaks (grep-verified)

real claim atomicity at scale (§9)

0 known-shipped defects as of writing

graceful degradation if the board dies

generic profile scrubbed of origin

decorrelated (cross-model) verifiers

Defects shipped: 0 known-shipped as of writing — not a hard zero. I can count catches; I cannot enumerate misses, so "0 shipped" is survivorship, stated as such.

§9 · Durability across sessions

7+ sessions, zero context loss.

State held across three substrates: the board (coordination spine), on-disk memory files (knowledge), and the append-only log (narrative). A single rolling window cannot physically hold a 2M-token build — it compacts, forgets, re-reads, re-litigates. Externalizing state turned an impossible-for-one-window task into a resumable one. That is the cleanest win, because it is categorical.

The spine has its own coordination problem

The board is shared mutable state and a single point of coupling — the first thing a CTO probes, so here it is straight. A row carries assigned_to and status; the protocol is a compare-and-set. But Directus gives no true row lock by default, and the build observed version collisions across parallel sessions — re-check before write. Contention was actually controlled one level up, by disjoint dispatch plus worktree isolation, not by the board being magically atomic. Scale past human-arbitrated dispatch and you need real claim semantics (optimistic concurrency, or a queue in front of the board). That is not built, and the row does not give it to you for free.

§10 · vs. an ungoverned single-agent workflow

No strawman. A workflow, not a product.

Ungoverned single-agent workflow	Muster
one rolling context window	durable task board + on-disk memory + append-only log (7+ sessions, no loss)
serial work, no coordination substrate	30+ subagents across 6 fan-outs around a shared board
trusts its own output	loop-proofs + adversarial verifiers, with honest attribution of which catches needed the fleet
ad-hoc, no written rules	a written §1–§10 constitution loaded every session

Prior-art, stated so the win isn't rigged: the real alternatives aren't a chat transcript — they're git branches/PRs, markdown plan-files in a repo (this kit uses those too), and framework state (LangGraph / CrewAI / Swarm). Muster's bet is a queryable, multi-writer, cross-session task board as the durable coordination atom — used alongside files, not instead of them.

§11 · Run it yourself

One command. The doctor board is the proof.

The apparatus is open and self-hostable. bin/elk-os is a resumable, idempotent phase machine whose only hard dependency is Docker.

# scaffold → live, from an empty volume
bin/elk-os init      # scaffold + env
bin/elk-os up        # compose core (Directus, Postgres, RAG, portal)
bin/elk-os migrate   # apply os_* schema
bin/elk-os seed      # profile data (generic | analogelk)
bin/elk-os wire      # enable native MCP, mint env-referenced token
bin/elk-os doctor    # green/red from-scratch acceptance board

The exhibits — the only outbound destinations that matter. Anything ember is a seam into the running system:

The running product ↗ the portal · generic profile · view the Analog Elk case study from inside
archived snapshot (P6 live capture)

The build board ↗ the actual task spine the agents used — public, read-only; browse the receipts, you can't write to them
archived snapshot

Read the whitepaper → GitHub repo — opening soon ↗

TLS note: the demo runs on a sslip.io host with a real, browser-trusted Let's Encrypt certificate, auto-provisioned by Caddy on a $3/mo box. The live instance may sleep between showings to save cost — archived snapshots below stand in.

§12 · Author & judgment calls

Mike Walliser

Creative technologist and AI-systems builder. The system is the protagonist; the architect is in the margins.

Here is what no agent decided. The reframe — "this is a product; Analog Elk is just the case study" — was a product call. The honesty ledger exists because a human insisted on it and defined what counts as shipped; agents optimize toward "done," and only a human reliably separates done from claimed-done. Bounding the fan-out — "this is enough verification" — is human ROI governance; the machine fans out forever. The arm64/amd64 fix happened live on metal, human-diagnosed. Mike designed this page and the system architecture — the page's craft is the portfolio piece.

n=1, self-built, author-run, same-base-model verifiers — these limitations are trust assets, stated plainly throughout. Read it as an existence proof and a reasoned playbook, not a controlled study.