Evaluating & optimizing

This is the loop Jetty is built around. A runbook carries its own definition of success alongside the job description, and every run is scored against it. Once you have a score, you can climb it. Tighten a prompt, reorder tools, raise a threshold, re-run, and watch the number go up, so the runbook stops being a static script and starts converging on what you wanted.

Evals live in the runbook spec

There is no separate eval framework bolted on the side. The pass/fail criteria, judge prompts and rubrics, and an optional golden dataset all live in the same markdown file as the job itself, so the work and the test that grades it version together in one file. See writing runbooks for exactly where each piece goes in the file.

Two styles of eval

Most runbooks use one or both. They answer different questions, so reach for the one that matches what “good” actually means for your task.

Programmatic: deterministic Python checks. Did the expected files get written? Are the required fields present in the output? Is a number above a threshold? These are cheap, fast, and never flake. Use them for anything you can assert in code.
Rubric: LLM-as-judge. The simple_judge step reads the output and scores it 1–5 across named dimensions, passing only if the total and the per-dimension thresholds are met. Use this for quality you can describe but can't assert: tone, completeness, whether an answer actually addresses the question.

A strong runbook usually layers both: a programmatic gate to catch the obviously-broken runs, then a rubric to grade the ones that pass.

Every run gets a score

Run the runbook and the eval result — the pass rate or the rubric score — is recorded on the trajectory. You can label runs to slice them later (by config, by model, by dataset split). That recorded score is the thing the optimization loop reads, so the loop only works if your evals are in the spec.

The optimize loop

/optimize-runbook is a skill in the Claude Code plugin. Point it at a runbook and it:

reads the prior trajectories for that runbook,
identifies the failure patterns — where the score drops, and why,
proposes targeted edits to the runbook (not a rewrite — surgical changes), and
re-runs the evals to confirm the score actually climbed.

That last step matters. The loop doesn't hand you a guess; it hands you a change it has already verified against the same bar. If the edit didn't help, you find out before you keep it.

What climbing looks like

Concretely: a runbook might start at a ~62% eval pass rate. One cycle tightens a vague prompt. The next reorders the tools so the agent reads context before it acts. A third nudges a threshold that was too lenient. After a few of those, the same runbook is passing ~84% on the same evals, with no model swap and no new infrastructure, just the spec converging on what you actually wanted.

Keep the judge independent of the agent under test. When a weak model grades its own output, it scores its mistakes as successes, so the loop climbs a lie. Use a separate, capable model (or a deterministic check) for grading, and the score you climb is one you can trust.

Want the framework-agnostic version of this loop — A/B one agent against another and flag regressions, with the agent running outside Jetty? See bring your own framework. To author the evals themselves, see writing runbooks →