The Loop Engineering Blueprint

A loop is not a cron job

Both run on a schedule. The difference is the decision-maker inside. A cron job runs a fixed script. A loop runs an agent that reads the current state, chooses the next action, does it, checks the result, and decides what to do next.

For about two years the workflow was high-effort ping-pong: write a prompt, wait, skim the output, adjust, send the next one. Your attention was the engine — step away from the keyboard and the agent froze mid-turn. Loop engineering is the move where you stop being the engine. You write a small program — a shell loop, a hosted task, a few hundred lines of code — and that program is what talks to the model.

The shift in one line A good prompt fixes one turn. It says nothing about which turn comes next, what counts as done, who signs off before an irreversible action, or how spend stays capped. The loop is everything around the prompt.

The inner loop and the outer loop

Inner loop — reason → act → observe → repeat. An agent that can only suggest isn't running a loop. One that can run code, read the result, fix it, and run again is.

Outer loop — the control system you design around it: the trigger that starts a run, the verifier that grades it, the memory that survives between runs, and the stop rules that decide when it quits.

When should you actually build one?

If the task is…	Do this
A one-off	Just prompt the model. A loop is overhead you don't need.
Repeating + clear pass/fail	Build a loop. This is the sweet spot — verification is possible and the payoff compounds.
Vague, no crisp "done"	Don't hand it to a while-loop. Define the goal first, then decide.

Rule of thumb Start simple. Add autonomy only when it pays for itself. Prompt engineering isn't dead — the leverage point just moved up a level, from the phrasing of one turn to the design of the system that runs many.

The six building blocks

Every loop that survives production solves the same responsibilities. Tools name them differently, but the parts line up. Two of these — verification and memory — are what separate a loop you can trust from one that just makes mistakes faster.

01 · TRIGGER

Automation

Something starts a run that isn't you typing — a schedule, a webhook, an event. Must come bundled with a stop condition.

/loop · /schedule · headless run -p

02 · ISOLATION

Worktrees

Give each agent its own checkout so parallel runs don't overwrite each other. Skip it if you only ever run one agent.

git worktree per sub-agent

03 · KNOWLEDGE

Skills

A markdown file holding the context you keep re-explaining — structure, rules, formats. Loaded on demand, written once.

SKILL.md · matched by its description

04 · REACH

Connectors

Tools the agent can touch — APIs, databases, inbox, task board, MCP servers. A loop locked to local files is a glorified script.

MCP · REST · filesystem

05 · TRUST

Sub-agents (maker–checker)

The agent that does the work must not be the one grading it. A separate reviewer — often on a leaner model — checks it against the rules.

builder proposes · reviewer disposes

06 · MEMORY

The state spine

On-disk state that survives runs, crashes, and context resets. Read at the start of every run, appended at the end.

markdown · JSON · SQLite · board

Why those two are highlighted The maker–checker split is what makes a loop safe to run unattended — the grader doesn't have to be smarter than the builder, just different, with its own rubric. And the memory spine is how a correction sticks: when the agent repeats a mistake, have it write the lesson into the state file so every future run inherits the fix.

Build your first loop

Five copy-paste assets, in order. Fill the [brackets], run the charter once while you watch, then wire the pieces around it. Every block below has a working Copy button.

Step 1 — The loop charter (paste this first)

This is the prompt that turns a one-off into a repeatable, self-checking unit of work. It is deliberately strict about what "done" means and how many items to touch per run.

charter.txt — paste into your agent

# LOOP CHARTER
ROLE: You run one pass of a maintenance loop. You are not chatting.

FIND: Look at [source: failing tests / inbox / board column].
      Pick at most [N = 1] item(s) to work on this run.
      If there is nothing to do, output "NOTHING_TO_DO" and stop.

DO:   Make the smallest change that resolves the item.
      Touch only files related to it.

CHECK: The item is done ONLY when [tests in path/ pass AND lint is clean].
       Run the check. Do not claim success without running it.

RECORD: Append one line to [STATE.md]: what you did + the check result.
        If you hit the same failure twice, write the lesson to [RULES.md].

STOP: After [N] item(s), or if the check fails twice on the same item.

Step 2 — A goal condition that actually stops

A goal-conditioned loop runs across turns until a condition is true. The key: a separate, faster model grades the condition after each turn — the writer doesn't get to mark its own homework. Write the condition so a grader can verify it from the run output, not from a vibe.

goal — set the stop condition

/goal all tests under test/auth pass and `npm run lint` exits 0
      and STATE.md has a new line for every item processed

4 constraints that silently break a goal 1. The evaluator reads the transcript, not your files — make the proof visible in the run output. 2. The condition must be objectively checkable (an exit code, a test result). 3. The stop signal must be unambiguous or the loop never stops. 4. Cap spend separately — a bad condition burns tokens until you notice.

Step 3 — Split the maker from the checker

Two agent definitions. The builder proposes a change; the reviewer independently grades it on a leaner model with its own rubric. This single split is what lets you trust a loop running unattended.

agents/reviewer.md

---
name: reviewer
model: [a fast, cheaper model]
description: Independently grades the builder's change. Never edits code.
---

You are the checker, not the author. Grade the change against:
  1. Does it pass the stated CHECK condition? Run it yourself.
  2. Does it touch only files relevant to the item?
  3. Does it violate any rule in RULES.md?

Output exactly one of:
  PASS  — with the check output that proves it
  FAIL  — with the single reason and the rule or test it broke

Step 4 — The memory spine

Boring on purpose. A flat file the loop reads at the start and appends at the end. This is how a run survives a crash, a context reset, or a 3 a.m. process kill.

STATE.md — read at start, append at end

# LOOP STATE — append-only log

## Done
- 2026-06-27 14:02  fixed test/auth/login.spec  → PASS (12/12)
- 2026-06-27 14:09  refactored token refresh    → PASS (lint clean)

## Open
- test/auth/session.spec  → FAILED twice, escalated to human

## Lessons (mirror critical ones into RULES.md)
- Never edit generated/ — it is overwritten on build.

Step 5 — The reference loop (the whole thing)

The canonical blueprint people cite: morning triage → merged PR. Design it once, never prompt it again. Note the budget cap — it is independent of the goal and non-negotiable.

loop.py — the outer loop, in any language

while not goal_met:
    state   = read_memory("STATE.md")      # survives restarts
    work    = find_next_item(state)         # inbox · board · failing tests
    if work is None:
        break

    branch  = new_worktree()                # isolate this unit of work
    result  = builder(work, skills, connectors)
    verdict = reviewer(result, rubric, tests)   # different agent · leaner model

    if verdict.passed:
        commit(branch); open_pr(branch)
        append_memory(state, done=work)
    else:
        append_memory(state, lesson=verdict.reason)   # write the lesson back

    if spend_so_far > budget_cap:           # the line that saves you money
        halt("budget cap hit")

What it looks like running

A single unattended pass. The builder makes a change, the reviewer grades it independently, state gets appended, and the budget guard watches the whole time.

loop — pass 7 of N · unattended

loop> find_work() → test/auth/session.spec failing (1 item) builder> editing src/auth/session.ts … running checks builder> tests 12/12 pass · lint clean reviewer> independent grade on fast model … reviewer> PASS — touched only session.ts, no rule broken loop> commit a1f9c2 · opened PR #284 loop> append STATE.md ✓ guard> spend 0.42 / 2.00 cap — ok, continuing loop> find_work() → NOTHING_TO_DO · goal met, exiting

Token economics A self-checking, retrying loop runs multiple turns per item, so cost compounds far faster than a single prompt. Start with small batches (that's what the N in the charter is for), lean on prompt caching, keep context tight, and always set the hard cap.

The 30-day rollout

Don't build the full system on day one. Earn each layer of autonomy before you add the next.

Days 1–5

Run it as a plain prompt

Pick one annoying, repeatable, pass/fail task you do by hand. Paste the charter, fill the brackets, run it once while you watch.

Days 6–12

Add a goal that stops

Give it an objectively checkable stop condition. Confirm a grader can verify "done" from the output. Run it once, watching.

Days 13–20

Split maker from checker

Add the builder + reviewer sub-agents and a memory file. Now the loop checks its own work and remembers across runs.

Days 21–25

Isolate and connect

Add worktrees so parallel runs don't collide, and connectors so "find the work" can reach your real inbox, board, or repo.

Days 26–30

Schedule, cap, and let go

Put it on a schedule, set the hard budget cap, and let it run unattended — spot-checking the traces daily, not babysitting every turn.

Where this stops being a terminal trick The moment a loop must run at 3 a.m. with no session open, survive a crash mid-run, pause for days waiting on human approval, or give each agent its own machine — you've outgrown the TUI and you're in runtime territory: agent servers, durable execution, and isolated sandboxes.

Stop prompting.
Start engineering loops.