Flue + Jetty
You shipped a Flue agent. You tweaked a prompt. Is it better or worse? You can't tell from one reply. That's the run, check, fix, rerun loop from How to build an AI agent, and Jetty is the check.
Flue owns the agent loop. Jetty grades each output and keeps it: every run becomes a trajectory you can score, label, and diff across versions, so a regression shows up before a customer finds it. The grader is a Jetty runbook the agent under test never sees, so a model can't grade its own output and rubber-stamp a regression.
The example is examples/flue-jetty in the SDK repo. It A/B-tests a help-desk triage agent in two configs (warm and terse), grading every reply and flagging the config that regressed:
TICKETS: 2 GRADER: rubric (independent)
config pass avg $/run
v1 (warm) 2/2 4.7 0.0093 ✅
v2 (terse) 0/2 2.7 0.0032 ❌ regressed
→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).Install
There's no flue add tooling jetty package. Jetty plugs in through its SDK:
npm install @jetty/sdkRequires @jetty/sdk 0.2.0+ (for gradeWithJetty).
The grading workflow
In a Flue workflow, draft with the agent, then hand the draft to a Jetty grading task and wait for the trajectory. The grade comes back as a row you can label and compare:
import { defineWorkflow } from "@flue/runtime";
import * as v from "valibot";
import { JettyClient, gradeWithJetty } from "@jetty/sdk";
import { triageAgent } from "../agent.js";
const jetty = new JettyClient(); // JETTY_API_TOKEN from env or ~/.config/jetty/token
export default defineWorkflow({
agent: triageAgent,
input: v.object({ ticket: v.any() }),
async run({ harness, input }) {
// 1. Flue runs the agent (it owns the loop).
const session = await harness.session();
const draft = await session.prompt(JSON.stringify(input.ticket));
// 2. Jetty grades it server-side, with a grader that isn't the author —
// upload, run the grader, read the grade, and label, in one call.
const { grade, trajectoryId } = await gradeWithJetty(jetty, "acme", "triage-grader", {
files: [{ filename: "case.json", data: draft.text }],
useTrialKeys: true, // grade on Jetty's free trial, no provider key
labels: (g) => ({ "eval.grade": String(g.total) }), // labels can read the grade
});
return { grade, gradeTrajectoryId: trajectoryId };
},
});Each grade is a Jetty trajectory: the inputs, outputs, score, and cost, ready to replay. Compare the eval.* labels across configs to see which version slipped.
Configure
| Variable | Required | Purpose |
|---|---|---|
JETTY_API_TOKEN | yes | Jetty API token (also read from ~/.config/jetty/token). |
JETTY_COLLECTION | yes | Collection that owns the grading task. |
JETTY_GRADE_TASK | yes | The grading runbook (e.g. triage-grader). |
JETTY_USE_TRIAL_KEYS | no | Grade on Jetty's free trial, no provider key (see below). |
ANTHROPIC_API_KEY | for the agent | The Flue agent runs on your machine, so it needs a model key. |
Credentials. Put anything sensitive in
secretParams, which the server keeps out of the stored trajectory. Don't put secrets ininitParams; that field is persisted. The SDK never logs your token. Tokens resolve from a constructor arg, thenJETTY_API_TOKEN, then~/.config/jetty/token.
What Jetty captures
| Flue | Jetty |
|---|---|
| Agent output (the draft) | The input the grading runbook scores |
| Grade (1–5) | Label eval.grade on the trajectory |
| Pass / fail vs. the bar | Label eval.pass |
Per-run cost (response.usage) | Label eval.cost_usd |
| Which agent config / version | Label eval.config |
| The whole graded run | A trajectory: inputs, outputs, steps, replayable |
Full walkthrough: catch a regression
A step-by-step run of the example: first offline (no keys, ~10 seconds), then live. There are three levels of do I need a key?:
- Offline demo (
npm run demo): no keys at all. - Grading on Jetty: covered by the free trial (10 runs, auto-activated). No API key.
- The live Flue agent runs on your machine via Flue, so it uses your Anthropic key.
1. Clone and build
Run the example bundled in the repo. The -w @jetty/sdk flags refer to this monorepo's workspaces, so clone it and run everything from inside the checkout. (To use the SDK in your own app instead, jump to Use the SDK in your own project.)
git clone https://github.com/jettyio/jetty-sdk.git
cd jetty-sdk
npm install # installs the SDK, the example, and Flue (one workspace install)
npm run build -w @jetty/sdk # the example imports the built SDK
cd examples/flue-jetty2. Run the offline demo (no keys)
npm run demoYou should see the verdict table immediately:
Acme Helpdesk — did my last change to the triage agent make it worse?
(simulated; run `npm run eval` for the real thing)
TICKETS: 5 GRADER: rubric (independent)
config pass avg $/run
------------ ----- ---- -------
v1 (warm) 5/5 4.5 0.0051 ✅
v2 (terse) 1/5 3.5 0.0039 ❌ regressed
→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).It also writes report.html and opens it: a styled verdict and per-run breakdown, the same report the live run produces. This is a deterministic stand-in, no spend. If you only want to understand the example, you can stop here.
3. Configure credentials (for the live run)
cp .env.example .envEdit .env:
ANTHROPIC_API_KEY=sk-ant-... # your Anthropic key
JETTY_API_TOKEN=mlc_... # your Jetty token
JETTY_COLLECTION=your-collection # a collection your token can write to
JETTY_GRADE_TASK=triage-grader # leave as-isLoad it into your shell (the scripts read process.env; they don't auto-load .env):
set -a && . ./.env && set +a4. Deploy the grader (one time)
The workflow calls a Jetty runbook that scores each draft. Deploy it into your collection:
npm run deploy-grader[env] pushed: ANTHROPIC_API_KEY
[task] created your-collection/triage-grader
✓ deployed: your-collection/triage-graderThis pushes your ANTHROPIC_API_KEY into the collection (so the grader's sandbox can run) and creates the triage-grader task from grader/RUNBOOK.md. Re-running it updates the task.
5. Run the live A/B
npx flue run eval --target node --input '{"tickets":2}'Each ticket is a real server-side grade (a sandbox run), so start with tickets:2 (~a few minutes) before bumping to the full 5. You'll see a line per run, then the verdict table:
v1 (warm) · reset: 4.7 PASS
v1 (warm) · double-charge: 4.7 PASS
v2 (terse) · reset: 2.7 fail
v2 (terse) · double-charge: 2.7 fail
TICKETS: 2 GRADER: rubric (independent)
config pass avg $/run
v1 (warm) 2/2 4.7 0.0093 ✅
v2 (terse) 0/2 2.7 0.0032 ❌ regressed
→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).…then it writes report.html and opens it: the verdict, a per-run breakdown, and links to each Jetty trajectory. The eval caught that the terse config regressed, before it ever reached a customer.
Run it on Jetty's free trial (no key)
Jetty gives every collection a free trial (10 runs, auto-activated, on Jetty's keys), so you can run the grading with no Anthropic key and no key-push:
# deploy without pushing a key (the trial covers the grader)
unset ANTHROPIC_API_KEY
npm run deploy-grader
# grade on the trial
JETTY_USE_TRIAL_KEYS=true npx flue run eval --target node --input '{"tickets":2}'The trial covers server-side Jetty runs (the grader). The Flue agent runs on your machine, so the full live run still needs your Anthropic key for the agent. But you can see Jetty's grading and trajectories on the trial with zero keys, and npm run demo needs no keys at all. (Sonnet and most models are covered; Opus-class is excluded. After 10 runs, add your own key in Settings.)
6. Inspect what got stored
Every grade is a Jetty trajectory, labelled with eval.config, eval.grade, eval.pass, and eval.cost_usd. View them in the Jetty UI (https://flows.jetty.io/<collection>/triage-grader) or from code:
import { JettyClient } from "@jetty/sdk";
const jetty = new JettyClient(); // reads JETTY_API_TOKEN
const list = await jetty.listTrajectories(process.env.JETTY_COLLECTION!, "triage-grader", 5);
for (const t of list.trajectories) {
const full = await jetty.getTrajectory(process.env.JETTY_COLLECTION!, "triage-grader", t.trajectory_id);
const labels = Object.fromEntries(full.labels.map((l) => [l.key, l.value]));
console.log(t.trajectory_id, labels["eval.config"], labels["eval.grade"], labels["eval.pass"]);
}Because the runs are durable and labelled, you can compare configs across releases, long after the terminal session that produced them.
How it works (the pieces)
| File | Role |
|---|---|
src/tickets.ts | The eval cases + the two configs (v1 warm, v2 terse). |
src/agent.ts | The Flue triage agent; the config's style is injected per prompt. |
src/workflows/eval.ts | The live loop: for each config × ticket → Flue draft → grade + label → collect. |
src/eval.ts | aggregate() (per-config pass-rate/score/cost) + renderVerdict() (the table). |
grader/RUNBOOK.md | The independent grader: a deterministic Python rubric. |
src/deploy-grader.ts | Deploys the grader via the SDK (createTask + setEnvironmentVars). |
The SDK does the orchestration: runWithFiles/runAndWait (with file upload), getTrajectory, downloadFile, addLabel, createTask. That's the part worth copying into your own eval.
Use the SDK in your own project
Everything above runs the example inside this repo. To use the SDK in a new, standalone project, you don't need this repo or any workspaces. Install the published package from npm:
mkdir my-app && cd my-app
npm init -y
npm pkg set type=module # the SDK is ESM
npm install @jetty/sdk// index.js
import { JettyClient } from "@jetty/sdk";
const jetty = new JettyClient(); // reads JETTY_API_TOKEN (or ~/.config/jetty/token)
console.log((await jetty.listCollections()).map((c) => c.name));There is no -w @jetty/sdk here. That flag only applies inside the monorepo. From there, copy the orchestration pattern from src/workflows/eval.ts into your own code. The pattern isn't Flue-specific: anywhere you can produce an agent output and call jetty.runAndWait(...) on it (LangChain, a raw provider SDK, a hand-rolled loop), Jetty drops in as the eval layer.
Make it yours
- Add cases: append to
TICKETSinsrc/tickets.ts. - Compare your own versions: edit the two entries in
CONFIGS, or changeFLUE_MODELto A/B models. - Move the bar: change
PASS_BARinsrc/eval.ts. - Swap the grader: the rubric in
grader/RUNBOOK.mdis plain Python. Replace it with an LLM-judge call for model-based grading, thennpm run deploy-grader.
Protect sensitive content
Trajectories persist step inputs and outputs, so they're content-bearing. Put credentials in secretParams (kept out of the stored trajectory), not initParams. If a draft can carry PII, redact it before grading, or grade a hash or summary instead. Treat trajectory storage like any other logging surface.
Troubleshooting
No workspaces found: --workspace=@jetty/sdk— you're not inside the jetty-sdk checkout. The-wflag is monorepo-only: clone the repo and run from its root (Step 1), or use the SDK in your own project (above).grader produced no /app/results files— the grader must (a) run on a model the claude-code runtime supports (useclaude-sonnet-4-6, not haiku), (b) keep thesecrets: ANTHROPIC_API_KEYblock in its frontmatter so the key reaches the sandbox, and (c) write to/app/results/. All three are already set ingrader/RUNBOOK.md.No Jetty API token found— you didn't load.env; runset -a && . ./.env && set +a, or exportJETTY_API_TOKEN.Cannot use import statement outside a module—flue runneeds"type": "module"(already set here) andflue.config.tsat the example root.- The live run is slow — each grade spins up a sandbox (a few minutes for 2 tickets). That's expected; the offline demo (
npm run demo) is the fast path.
Note on scope: Jetty has no external trajectory-ingestion endpoint yet, so grading runs through a Jetty task (which is what creates the trajectory) rather than pushing an externally-produced trace.