Bring your own framework

You don't have to run your agent on Jetty to get value from Jetty. The opposite also works: keep your agent wherever it already lives, and use Jetty as the independent evaluation, observability, and optimization layer around it.

There are two ways to combine an agent with Jetty, and they're worth keeping straight:

Run your agent on Jetty. Jetty executes the agent inside its own managed sandbox using one of the built-in runbook runtimes (claude-code, codex, gemini-cli, hermes).
Run your agent anywhere, evaluate on Jetty. Your framework runs the agent; Jetty runs a separate grader runbook, stores the trajectory, and the SDK orchestrates the comparison. That's this page.

The second pattern is what makes Jetty framework-agnostic: the grader is independent of the agent under test, so you can swap frameworks, models, or providers and keep the exact same measuring stick.

The worked example: Flue + Jetty

examples/flue-jetty in the SDK repo is a complete, runnable version of this pattern. Flue is an agent framework: you define an agent with createAgent() from @flue/runtime and drive it through a session (session.prompt(text)). The example A/B-tests two configurations of a help-desk triage agent (a warm variant against a terse one), grades every reply, and flags the variant that regressed.

The flow for each ticket:

Flue runs the agent. Each ticket is sent to both agent configs via session.prompt(...). The agent executes entirely on the Flue runtime, so Jetty never wraps or re-implements it.
Jetty grades the output. For each draft, the SDK calls runAndWait() on a separate grader runbook, passing the case as an uploaded file. The grade is read back with downloadFile() and the run is tagged with addLabel().
The SDK aggregates and judges. Per-config pass rate, average score, and average cost are computed client-side, and any config below the winner is marked as a regression.

The grader is deliberately independent

The thing being graded is the agent's output; the thing doing the grading is a Jetty runbook — grader/RUNBOOK.md — that the agent under test never sees. It runs a deterministic Python scorer across three dimensions (Addresses the issue, Tone, Completeness), each 1–5, and passes only if the total is ≥ 4.0 and every dimension is ≥ 3.

// src/workflows/eval.ts — per ticket, per config
const draft = await session.prompt(ticket.body);      // Flue runs the agent

const run = await jetty.runAndWait(                   // Jetty grades it
  collection,
  "triage-grader",
  { vars: { case: ticket.id } },
  { pollMs: 4000, files: [asFile(draft)] }
);

const grade = JSON.parse(
  new TextDecoder().decode((await jetty.downloadFile(run.gradeKey)).bytes)
);
await jetty.addLabel(collection, "triage-grader", run.id, "config", config.name);

Because the scorer is a runbook and not the agent grading itself, results are reproducible: re-running the eval — or pointing it at a different framework — measures against the same bar every time. (Letting a weak model grade its own output is a known way to rubber-stamp broken work; an independent grader avoids it.)

Regression detection

src/eval.ts sets a fixed bar (PASS_BAR = 4.0), aggregates each config's pass rate / average score / average cost, ranks the configs, and flags any that isn't the winner and scored below it. The output is a verdict table — ✅ keep the winner, ❌ regressed for the rest — so a CI job can fail the build when a change drops quality.

Try it

There's an offline mode that needs no API keys — a great first run:

# Offline: deterministic stand-in, no keys
npm install && npm run build -w @jetty/sdk
cd examples/flue-jetty && npm run demo

# Live A/B over Flue + Jetty
cp .env.example .env && set -a && . ./.env && set +a
npm run deploy-grader
npx flue run eval --target node --payload '{"tickets":2}'

Using your own framework instead of Flue

The pattern doesn't depend on Flue. Anywhere you can:

produce an agent output (a string, a file, a JSON blob), and
call jetty.runAndWait(collection, graderTask, …) on it,

…you can drop Jetty in as the eval layer — LangChain, a raw provider SDK, or a hand-rolled loop. Write the grader once as a runbook, and every framework you try is measured the same way. The only Jetty-specific code is the SDK calls.

Prefer to run the agent inside Jetty's sandbox instead? See the four built-in runbook runtimes →