Trajectories & evaluation

Every time Jetty runs something — a passthrough model call, a workflow, or a runbook — it records a trajectory: a complete, replayable trace of what happened. This is the observability layer, and it's also the raw material for improving a runbook over time.

What a trajectory captures

A trajectory is the run, frozen for inspection. For each step it records:

the activity that ran and how long it took,
the inputs it received (resolved from path expressions),
the outputs it produced, and any files written to storage,
status (running, completed, failed, cancelled) and metadata.

Trajectories are immutable historical records. You can replay one in the app's trajectory viewer, or fetch it from code with getTrajectory (SDK) or the get-trajectory tool (MCP). When you need to find runs, list them and filter by status — see the API reference.

Labels

A label is a key/value tag you attach to a trajectory — config: warm, graded: pass, cohort: 2026-06. Labels are how you group runs and carry eval results: a grading step (or an A/B harness) writes the verdict as a label, and you query it later. Add one with addLabel / add-label.

Evaluation: turning runs into a score

An eval answers one question about a run: did it actually work? In Jetty, evals are declared inside the runbook itself, so the standard travels with the artifact. There are two styles:

Programmatic. Deterministic checks: do the required files exist, are the fields present, is the number above a threshold. Cheap, repeatable, no model needed.
Rubric (LLM-as-judge). A second model scores the output against named dimensions — for example the simple_judge step scoring 1–5 on accuracy, tone, and completeness, passing only above set thresholds. Use this for quality you can't assert with an equality check.

Keep the judge independent of the agent under test. A model grading its own output will rubber-stamp broken work. Judge ≠ subject.

The hill-climbing loop

Because every run is a trajectory and every runbook carries its evals, you can measure quality, then improve it on purpose:

Run the runbook over a few cases. Each produces a trajectory with an eval result.
Read the trajectories where the eval failed. The trace shows exactly which step fell short.
Edit the runbook — tighten the prompt, reorder a step, fix a threshold.
Re-run and confirm the score climbed. Repeat.

The /optimize-runbook skill automates this: it reads prior trajectories, finds the failure patterns, proposes targeted edits, and re-runs the evals to verify it actually improved. A runbook might go from a 62% pass rate to 84% over a few cycles, because the loop keeps a runbook current as models and rubrics shift, where a single run only tells you about one moment. Full walkthrough in Evaluating & optimizing.

Keeping evals fresh

Models and providers drift. A runbook that passed last month can quietly regress when a model updates. Schedule a routine to re-run your evals on a cadence, or gate merges on them in CI, so you catch a regression the day it happens instead of in production.

Next: add evals and hill-climb a runbook →