Flue + Jetty

You shipped a Flue agent. You tweaked a prompt. Is it better or worse? You can't tell from one reply. That's the run, check, fix, rerun loop from How to build an AI agent, and Jetty is the check.

Flue owns the agent loop. Jetty grades each output and keeps it: every run becomes a trajectory you can score, label, and diff across versions, so a regression shows up before a customer finds it. The grader is a Jetty runbook the agent under test never sees, so a model can't grade its own output and rubber-stamp a regression.

The example is examples/flue-jetty in the SDK repo. It A/B-tests a help-desk triage agent in two configs (warm and terse), grading every reply and flagging the config that regressed:

TICKETS: 2   GRADER: rubric (independent)
 config        pass   avg   $/run
 v1 (warm)    2/2    4.7   0.0093  ✅
 v2 (terse)   0/2    2.7   0.0032  ❌  regressed
→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).

Install

There's no flue add tooling jetty package. Jetty plugs in through its SDK:

npm install @jetty/sdk

Requires @jetty/sdk 0.2.0+ (for gradeWithJetty).

The grading workflow

In a Flue workflow, draft with the agent, then hand the draft to a Jetty grading task and wait for the trajectory. The grade comes back as a row you can label and compare:

import { defineWorkflow } from "@flue/runtime";
import * as v from "valibot";
import { JettyClient, gradeWithJetty } from "@jetty/sdk";
import { triageAgent } from "../agent.js";

const jetty = new JettyClient(); // JETTY_API_TOKEN from env or ~/.config/jetty/token

export default defineWorkflow({
  agent: triageAgent,
  input: v.object({ ticket: v.any() }),
  async run({ harness, input }) {
    // 1. Flue runs the agent (it owns the loop).
    const session = await harness.session();
    const draft = await session.prompt(JSON.stringify(input.ticket));

    // 2. Jetty grades it server-side, with a grader that isn't the author —
    //    upload, run the grader, read the grade, and label, in one call.
    const { grade, trajectoryId } = await gradeWithJetty(jetty, "acme", "triage-grader", {
      files: [{ filename: "case.json", data: draft.text }],
      useTrialKeys: true,                          // grade on Jetty's free trial, no provider key
      labels: (g) => ({ "eval.grade": String(g.total) }), // labels can read the grade
    });

    return { grade, gradeTrajectoryId: trajectoryId };
  },
});

Each grade is a Jetty trajectory: the inputs, outputs, score, and cost, ready to replay. Compare the eval.* labels across configs to see which version slipped.

Configure

Variable	Required	Purpose
`JETTY_API_TOKEN`	yes	Jetty API token (also read from `~/.config/jetty/token`).
`JETTY_COLLECTION`	yes	Collection that owns the grading task.
`JETTY_GRADE_TASK`	yes	The grading runbook (e.g. `triage-grader`).
`JETTY_USE_TRIAL_KEYS`	no	Grade on Jetty's free trial, no provider key (see below).
`ANTHROPIC_API_KEY`	for the agent	The Flue agent runs on your machine, so it needs a model key.

Credentials. Put anything sensitive in secretParams, which the server keeps out of the stored trajectory. Don't put secrets in initParams; that field is persisted. The SDK never logs your token. Tokens resolve from a constructor arg, then JETTY_API_TOKEN, then ~/.config/jetty/token.

What Jetty captures

Flue	Jetty
Agent output (the draft)	The input the grading runbook scores
Grade (1–5)	Label `eval.grade` on the trajectory
Pass / fail vs. the bar	Label `eval.pass`
Per-run cost (`response.usage`)	Label `eval.cost_usd`
Which agent config / version	Label `eval.config`
The whole graded run	A trajectory: inputs, outputs, steps, replayable

Full walkthrough: catch a regression

A step-by-step run of the example: first offline (no keys, ~10 seconds), then live. There are three levels of do I need a key?:

Offline demo (npm run demo): no keys at all.
Grading on Jetty: covered by the free trial (10 runs, auto-activated). No API key.
The live Flue agent runs on your machine via Flue, so it uses your Anthropic key.

1. Clone and build

Run the example bundled in the repo. The -w @jetty/sdk flags refer to this monorepo's workspaces, so clone it and run everything from inside the checkout. (To use the SDK in your own app instead, jump to Use the SDK in your own project.)

git clone https://github.com/jettyio/jetty-sdk.git
cd jetty-sdk
npm install                    # installs the SDK, the example, and Flue (one workspace install)
npm run build -w @jetty/sdk    # the example imports the built SDK
cd examples/flue-jetty

2. Run the offline demo (no keys)

npm run demo

You should see the verdict table immediately:

Acme Helpdesk — did my last change to the triage agent make it worse?
(simulated; run `npm run eval` for the real thing)

TICKETS: 5   GRADER: rubric (independent)

 config        pass   avg   $/run
 ------------  -----  ----  -------
 v1 (warm)    5/5    4.5   0.0051  ✅
 v2 (terse)   1/5    3.5   0.0039  ❌  regressed

→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).

It also writes report.html and opens it: a styled verdict and per-run breakdown, the same report the live run produces. This is a deterministic stand-in, no spend. If you only want to understand the example, you can stop here.

3. Configure credentials (for the live run)

cp .env.example .env

Edit .env:

ANTHROPIC_API_KEY=sk-ant-...          # your Anthropic key
JETTY_API_TOKEN=mlc_...               # your Jetty token
JETTY_COLLECTION=your-collection      # a collection your token can write to
JETTY_GRADE_TASK=triage-grader        # leave as-is

Load it into your shell (the scripts read process.env; they don't auto-load .env):

set -a && . ./.env && set +a

4. Deploy the grader (one time)

The workflow calls a Jetty runbook that scores each draft. Deploy it into your collection:

npm run deploy-grader

[env] pushed: ANTHROPIC_API_KEY
[task] created your-collection/triage-grader
✓ deployed: your-collection/triage-grader

This pushes your ANTHROPIC_API_KEY into the collection (so the grader's sandbox can run) and creates the triage-grader task from grader/RUNBOOK.md. Re-running it updates the task.

5. Run the live A/B

npx flue run eval --target node --input '{"tickets":2}'

Each ticket is a real server-side grade (a sandbox run), so start with tickets:2 (~a few minutes) before bumping to the full 5. You'll see a line per run, then the verdict table:

  v1 (warm) · reset: 4.7 PASS
  v1 (warm) · double-charge: 4.7 PASS
  v2 (terse) · reset: 2.7 fail
  v2 (terse) · double-charge: 2.7 fail

TICKETS: 2   GRADER: rubric (independent)
 config        pass   avg   $/run
 v1 (warm)    2/2    4.7   0.0093  ✅
 v2 (terse)   0/2    2.7   0.0032  ❌  regressed
→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).

…then it writes report.html and opens it: the verdict, a per-run breakdown, and links to each Jetty trajectory. The eval caught that the terse config regressed, before it ever reached a customer.

Run it on Jetty's free trial (no key)

Jetty gives every collection a free trial (10 runs, auto-activated, on Jetty's keys), so you can run the grading with no Anthropic key and no key-push:

# deploy without pushing a key (the trial covers the grader)
unset ANTHROPIC_API_KEY
npm run deploy-grader

# grade on the trial
JETTY_USE_TRIAL_KEYS=true npx flue run eval --target node --input '{"tickets":2}'

The trial covers server-side Jetty runs (the grader). The Flue agent runs on your machine, so the full live run still needs your Anthropic key for the agent. But you can see Jetty's grading and trajectories on the trial with zero keys, and npm run demo needs no keys at all. (Sonnet and most models are covered; Opus-class is excluded. After 10 runs, add your own key in Settings.)

6. Inspect what got stored

Every grade is a Jetty trajectory, labelled with eval.config, eval.grade, eval.pass, and eval.cost_usd. View them in the Jetty UI (https://flows.jetty.io/<collection>/triage-grader) or from code:

import { JettyClient } from "@jetty/sdk";
const jetty = new JettyClient(); // reads JETTY_API_TOKEN
const list = await jetty.listTrajectories(process.env.JETTY_COLLECTION!, "triage-grader", 5);
for (const t of list.trajectories) {
  const full = await jetty.getTrajectory(process.env.JETTY_COLLECTION!, "triage-grader", t.trajectory_id);
  const labels = Object.fromEntries(full.labels.map((l) => [l.key, l.value]));
  console.log(t.trajectory_id, labels["eval.config"], labels["eval.grade"], labels["eval.pass"]);
}

Because the runs are durable and labelled, you can compare configs across releases, long after the terminal session that produced them.

How it works (the pieces)

File	Role
`src/tickets.ts`	The eval cases + the two configs (`v1` warm, `v2` terse).
`src/agent.ts`	The Flue triage agent; the config's style is injected per prompt.
`src/workflows/eval.ts`	The live loop: for each config × ticket → Flue draft → grade + label → collect.
`src/eval.ts`	`aggregate()` (per-config pass-rate/score/cost) + `renderVerdict()` (the table).
`grader/RUNBOOK.md`	The independent grader: a deterministic Python rubric.
`src/deploy-grader.ts`	Deploys the grader via the SDK (`createTask` + `setEnvironmentVars`).

The SDK does the orchestration: runWithFiles/runAndWait (with file upload), getTrajectory, downloadFile, addLabel, createTask. That's the part worth copying into your own eval.

Use the SDK in your own project

Everything above runs the example inside this repo. To use the SDK in a new, standalone project, you don't need this repo or any workspaces. Install the published package from npm:

mkdir my-app && cd my-app
npm init -y
npm pkg set type=module          # the SDK is ESM
npm install @jetty/sdk

// index.js
import { JettyClient } from "@jetty/sdk";
const jetty = new JettyClient();               // reads JETTY_API_TOKEN (or ~/.config/jetty/token)
console.log((await jetty.listCollections()).map((c) => c.name));

There is no -w @jetty/sdk here. That flag only applies inside the monorepo. From there, copy the orchestration pattern from src/workflows/eval.ts into your own code. The pattern isn't Flue-specific: anywhere you can produce an agent output and call jetty.runAndWait(...) on it (LangChain, a raw provider SDK, a hand-rolled loop), Jetty drops in as the eval layer.

Make it yours

Add cases: append to TICKETS in src/tickets.ts.
Compare your own versions: edit the two entries in CONFIGS, or change FLUE_MODEL to A/B models.
Move the bar: change PASS_BAR in src/eval.ts.
Swap the grader: the rubric in grader/RUNBOOK.md is plain Python. Replace it with an LLM-judge call for model-based grading, then npm run deploy-grader.

Protect sensitive content

Trajectories persist step inputs and outputs, so they're content-bearing. Put credentials in secretParams (kept out of the stored trajectory), not initParams. If a draft can carry PII, redact it before grading, or grade a hash or summary instead. Treat trajectory storage like any other logging surface.

Troubleshooting

No workspaces found: --workspace=@jetty/sdk — you're not inside the jetty-sdk checkout. The -w flag is monorepo-only: clone the repo and run from its root (Step 1), or use the SDK in your own project (above).
grader produced no /app/results files — the grader must (a) run on a model the claude-code runtime supports (use claude-sonnet-4-6, not haiku), (b) keep the secrets: ANTHROPIC_API_KEY block in its frontmatter so the key reaches the sandbox, and (c) write to /app/results/. All three are already set in grader/RUNBOOK.md.
No Jetty API token found — you didn't load .env; run set -a && . ./.env && set +a, or export JETTY_API_TOKEN.
Cannot use import statement outside a module — flue run needs "type": "module" (already set here) and flue.config.ts at the example root.
The live run is slow — each grade spins up a sandbox (a few minutes for 2 tickets). That's expected; the offline demo (npm run demo) is the fast path.

Note on scope: Jetty has no external trajectory-ingestion endpoint yet, so grading runs through a Jetty task (which is what creates the trajectory) rather than pushing an externally-produced trace.