Back to Thoughts
April 22, 2026

Introducing Runbook

I built runbook because of a specific annoyance I kept hitting with AI coding agents. You tell them the workflow. Here’s the build command, here’s the test command, here’s the lint command. You write it in CLAUDE.md. You set up skills. Next time the agent needs to run tests, it still invents its own command. Sometimes wrapped in grep and sed to keep the token count down. Sometimes with the wrong flags. Sometimes in the wrong directory. Every invocation, a different command.

The root issue is that context isn’t constraint. Even when the right command is documented in the context window, the model is still making a fresh decision every time it needs to run something. It doesn’t remember that two minutes ago it ran npm run test:e2e successfully. It decides again, and maybe this time it’s npx vitest piped through grep for “FAIL”. Same task, different command, unpredictable result. The documentation helped sometimes, but it didn’t fix the underlying behavior.

How runbook works

The premise is simple: if you don’t want the model improvising the command, don’t ask it to choose the command. Give it a deterministic tool that runs the command for it. Runbook is an MCP server. You define your project’s workflow commands in a YAML file, add runbook as an MCP server in your agentic coding tool (mine’s Claude Code), and each task becomes an MCP tool the agent can call. The agent isn’t being convinced to use the right command. The right command is the only thing the tool can do.

A minimum config looks like this:

version: "1.0"

tasks:
  build:
    description: "Build the project"
    command: "npm run build"
    type: oneshot

  test:
    description: "Run tests"
    command: "npm run test"
    type: oneshot

  dev:
    description: "Start the dev server"
    command: "npm run dev"
    type: daemon

Drop that at .runbook/tasks.yaml, add runbook to .mcp.json, and the agent gets run_build, run_test, start_dev, stop_dev, status_dev, and logs_dev as tools.

What matters most

Three features matter most in practice.

The first is that long-running task output never enters the context window unless the agent asks for it. When runbook executes a task, it writes the full output to a session log on disk and returns the agent a summary plus a session ID. A five-thousand-line build log doesn’t cost the agent five thousand lines of context. If it needs to look at something specific, it calls read_session_log with the session ID and a regex filter, and gets back only the matching lines.

The second is daemon supervision with cross-session clustering. For long-running services like dev servers and previews, runbook tracks PID and metadata on disk. If I open a second Claude Code session on the same project, it sees the dev server is already running and can query its status, read its logs, or stop it. The daemon isn’t owned by any one session. It’s owned by the project. Two sessions can work in parallel and both know what’s already up.

The third is prompts. Runbook lets you define workflow prompts in YAML that template-reference your task tool names. So instead of telling the agent “run lint, then test, then build” in a CLAUDE.md, you write the workflow once as a prompt and it gets exposed through MCP. The prompt resolves the task names at the moment of use, so renaming a task doesn’t break documentation that lives somewhere else. A small example:

prompts:
  ship-check:
    description: "Pre-deploy checks"
    content: |
      Run these in order, stop on first failure:
      1. {{run_task "lint"}}
      2. {{run_task "typecheck"}}
      3. {{run_task "test"}}
      4. {{run_task "build"}}

{{run_task "lint"}} resolves to run_lint, the actual tool name. The workflow lives where the project lives. New sessions see it without me re-typing anything.

Smaller details

A few smaller details are worth calling out. The same runbook binary is a CLI, so I can run runbook list or runbook logs dev --filter=ERROR from my terminal and get the same view of state the agent has. The refresh_config MCP tool hot-reloads the manifest, so I can edit a YAML file and have the agent pick up new or changed tools without restarting the agentic coding tool. Configs can be split across multiple YAML files in a .runbook/ directory (tasks in one file, daemons in another, prompts in a third), and runbook merges them. Task groups and the task dependency graph are exposed as MCP resources so agents can introspect the workflow structure when they need to.

How I use it

In practice I usually disable the Bash tool entirely when runbook is wired up. If the agent needs to do something that isn’t already a task, it adds the task to the runbook YAML and calls refresh_config, which registers the new tool without restarting the agentic coding tool. The new task is what it uses from then on.

The combined effect is that I stopped re-explaining commands. The agent stopped silently changing the invocation between runs. Long builds and test runs stopped costing context. And running things in the background works the same way whether “later” is thirty seconds away in the same session or an hour away in a new one.

The underlying point is that you can reduce how much the AI has to guess by giving it fewer degrees of freedom at the places where guessing costs you most. Command invocation is one of those places. The model is great at reasoning about what to do next. It’s less good at remembering that the test command on this project is not pytest because the project isn’t Python. Take the choice away. Give it a tool. The tool does the right thing because you wrote it to.

Source: https://github.com/launchcg/runbook