eve + Jetty

You shipped an eve agent, built on Vercel's filesystem-first framework where an agent is a directory. You tweaked a prompt. Is it better or worse? You can't tell from one reply. That's the run, check, fix, rerun loop from How to build an AI agent, and Jetty is the check.

eve owns the agent loop. Jetty grades each output and keeps it: every run becomes a trajectory you can score, label, and diff across versions, so a regression shows up before a customer finds it. The grader is a Jetty runbook the agent under test never sees, so a model can't grade its own output and rubber-stamp a regression.

The example is examples/eve-jetty in the SDK repo. It A/B-tests a help-desk triage agent in two configs (warm and terse), grading every reply and flagging the config that regressed:

TICKETS: 2   GRADER: rubric (independent)
 config        pass   avg   $/run
 v1 (warm)    2/2    4.7   0.0093  ✅
 v2 (terse)   0/2    2.7   0.0032  ❌  regressed
→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).

eve already has evals. Where Jetty fits

eve ships a real eval system (defineEval, eve eval, an LLM-as-judge, a CI deploy gate). Those are your unit and integration tests: you author the assertions, they run your scripted sessions, they gate a deploy. Jetty sits beside them: an independent grader the builder didn't write, and a durable store where every run is a labelled trajectory you can diff across versions, models, and many agents, long after the CI job ended. Keep eve eval for “did this commit break a rule”; add Jetty for “which version is better, and is it improving over time.”

Install

There's no eve add tooling jetty package. Jetty plugs in through its SDK:

npm install @jetty/sdk

Requires @jetty/sdk 0.2.0+ (for gradeWithJetty) and eve 0.16+.

The grading flow

eve's agent is a directory you run with npx eve dev (or deploy to Vercel). Drive it over the typed eve/client, then hand the reply to a Jetty grading task and wait for the trajectory. The grade comes back as a row you can label and compare:

import { Client } from "eve/client";
import { JettyClient, gradeWithJetty } from "@jetty/sdk";

const eve = new Client({ host: process.env.EVE_URL ?? "http://127.0.0.1:2000" });
const jetty = new JettyClient(); // JETTY_API_TOKEN from env or ~/.config/jetty/token

// 1. eve runs the agent (it owns the loop).
const turn = await (await eve.session().send(prompt)).result();

// 2. Jetty grades it server-side, with a grader that isn't the author —
//    upload, run the grader, read the grade, and label, in one call.
const { grade, trajectoryId } = await gradeWithJetty(jetty, "acme", "triage-grader", {
  files: [{ filename: "case.json", data: turn.message ?? "" }],
  useTrialKeys: true,                          // grade on Jetty's free trial, no provider key
  labels: (g) => ({ "eval.grade": String(g.total) }), // labels can read the grade
});

Each grade is a Jetty trajectory: the inputs, outputs, grade, and cost, ready to replay. Compare the eval.* labels across configs to see which version slipped.

Native reporter (next). The tighter integration is a Jetty() eval reporter that drops into eve's evals.config.ts exactly where the built-in Braintrust(...) reporter goes, so every eve eval result lands in Jetty automatically. It ships once Jetty's trajectory-ingestion endpoint lands; the SDK harness above works today.

Configure

Variable	Required	Purpose
`JETTY_API_TOKEN`	yes	Jetty API token (also read from `~/.config/jetty/token`).
`JETTY_COLLECTION`	yes	Collection that owns the grading task.
`JETTY_GRADE_TASK`	yes	The grading runbook (e.g. `triage-grader`).
`JETTY_USE_TRIAL_KEYS`	no	Grade on Jetty's free trial, no provider key (see below).
`EVE_URL`	for the agent	Where the eve agent is reachable (`npx eve dev` serves `127.0.0.1:2000`).
`AI_GATEWAY_API_KEY` / `VERCEL_OIDC_TOKEN`	for the agent	eve resolves models through Vercel AI Gateway, so the agent needs one of these.
`OPENROUTER_API_KEY`	alt. to gateway	No AI Gateway? `agent/agent.ts` routes the agent through OpenRouter directly, an AI SDK “external” provider that bypasses the gateway (eve's `model` field accepts any AI SDK language model).

Credentials. Put anything sensitive in secretParams, which the server keeps out of the stored trajectory. Don't put secrets in initParams; that field is persisted. The SDK never logs your token. Tokens resolve from a constructor arg, then JETTY_API_TOKEN, then ~/.config/jetty/token.

What Jetty captures

eve	Jetty
Agent output (the draft)	The input the grading runbook scores
Grade (1–5)	Label `eval.grade` on the trajectory
Pass / fail vs. the bar	Label `eval.pass`
Per-run cost (estimated from `step.completed` token usage)	Label `eval.cost_est_usd`
Which agent config / version	Label `eval.config`
The whole graded run	A trajectory: inputs, outputs, steps, replayable

eve reports token usage but no dollar cost, so the example estimates $/run from tokens and a small per-model price table (src/cost.ts). Tune it to your real rates.

Full walkthrough: catch a regression

A step-by-step run of the example: first offline (no keys, ~10 seconds), then live. There are three levels of do I need a key?:

Offline demo (npm run demo): no keys, no eve, no network.
Grading on Jetty: covered by the free trial (10 runs, auto-activated). No API key.
The live eve agent runs via npx eve dev and resolves models through AI Gateway, so it uses your AI Gateway credential.

1. Clone and build

Run the example bundled in the repo. The -w @jetty/sdk flags refer to this monorepo's workspaces, so clone it and run everything from inside the checkout. (To use the SDK in your own app instead, jump to Use the SDK in your own project.)

git clone https://github.com/jettyio/jetty-sdk.git
cd jetty-sdk
npm install                    # installs the SDK, the example, and eve (one workspace install)
npm run build -w @jetty/sdk    # the example imports the built SDK
cd examples/eve-jetty

2. Run the offline demo (no keys)

npm run demo

You should see the verdict table immediately:

Acme Helpdesk — did my last change to the triage agent make it worse?
(simulated; run `npm run ab-eval` against a live `npx eve dev` for the real thing)

TICKETS: 5   GRADER: rubric (independent)

 config        pass   avg   $/run
 ------------  -----  ----  -------
 v1 (warm)    5/5    4.5   0.0051  ✅
 v2 (terse)   1/5    3.5   0.0039  ❌  regressed

→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).

It also writes report.html and opens it: a styled verdict and per-run breakdown, the same report the live run produces. This is a deterministic stand-in, no spend. If you only want to understand the example, you can stop here.

3. Configure credentials (for the live run)

cp .env.example .env

Edit .env:

AI_GATEWAY_API_KEY=...                 # or VERCEL_OIDC_TOKEN, for the eve agent
JETTY_API_TOKEN=mlc_...                # your Jetty token
JETTY_COLLECTION=your-collection       # a collection your token can write to
JETTY_GRADE_TASK=triage-grader         # leave as-is
EVE_URL=http://127.0.0.1:2000          # where `npx eve dev` serves the agent

Load it into your shell (the scripts read process.env; they don't auto-load .env):

set -a && . ./.env && set +a

4. Deploy the grader (one time)

The harness calls a Jetty runbook that scores each draft. Deploy it into your collection:

npm run deploy-grader

This creates the triage-grader task from grader/RUNBOOK.md (and pushes a provider key into the collection if you have one). Re-running it updates the task.

5. Serve the agent and run the live A/B

# terminal 1: serve the eve agent (needs Node 24+)
npx eve dev

# terminal 2: A/B-eval it (start small: each ticket is a real server-side grade)
EVAL_TICKETS=2 npm run ab-eval

You'll see a line per run, then the verdict table:

  v1 (warm) · reset: 4.7 PASS
  v1 (warm) · double-charge: 4.7 PASS
  v2 (terse) · reset: 2.7 fail
  v2 (terse) · double-charge: 2.7 fail

TICKETS: 2   GRADER: rubric (independent)
 config        pass   avg   $/run
 v1 (warm)    2/2    4.7   0.0093  ✅
 v2 (terse)   0/2    2.7   0.0032  ❌  regressed
→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).

…then it writes report.html and opens it: the verdict, a per-run breakdown, and links to each Jetty trajectory. The eval caught that the terse config regressed, before it ever reached a customer.

Run it on Jetty's free trial (no key)

Jetty gives every collection a free trial (10 runs, auto-activated, on Jetty's keys), so you can run the grading with no provider key and no key-push:

JETTY_USE_TRIAL_KEYS=true EVAL_TICKETS=2 npm run ab-eval

The trial covers server-side Jetty runs (the grader). The eve agent runs on your machine via npx eve dev, so the full live run still needs an AI Gateway credential for the agent. But you can see Jetty's grading and trajectories on the trial, and npm run demo needs no keys at all. (Sonnet and most models are covered; Opus-class is excluded. After 10 runs, add your own key in Settings.)

6. Inspect what got stored

Every grade is a Jetty trajectory, labelled with eval.config, eval.grade, eval.pass, and eval.cost_est_usd. View them in the Jetty UI (https://flows.jetty.io/<collection>/triage-grader) or from code:

import { JettyClient } from "@jetty/sdk";
const jetty = new JettyClient(); // reads JETTY_API_TOKEN
const list = await jetty.listTrajectories(process.env.JETTY_COLLECTION!, "triage-grader", 5);
for (const t of list.trajectories) {
  const full = await jetty.getTrajectory(process.env.JETTY_COLLECTION!, "triage-grader", t.trajectory_id);
  const labels = Object.fromEntries(full.labels.map((l) => [l.key, l.value]));
  console.log(t.trajectory_id, labels["eval.config"], labels["eval.grade"], labels["eval.pass"]);
}

Because the runs are durable and labelled, you can compare configs across releases, long after the terminal session that produced them.

How it works (the pieces)

File	Role
`agent/instructions.md` + `agent/agent.ts`	The eve agent as a directory: the always-on system prompt and the runtime config.
`src/tickets.ts`	The eval cases + the two configs (`v1` warm, `v2` terse).
`src/ab-eval.ts`	The live loop: for each config × ticket → drive eve via `eve/client` → grade + label → collect.
`src/cost.ts`	Estimates per-run cost from eve token usage (eve has no dollar-cost field).
`src/eval.ts`	`aggregate()` (per-config pass-rate/grade/cost) + `renderVerdict()` (the table).
`grader/RUNBOOK.md`	The independent grader: a deterministic Python rubric.

The SDK does the orchestration: runWithFiles/runAndWait (with file upload), getTrajectory, downloadFile, addLabel, createTask. That's the part worth copying into your own eval.

Use the SDK in your own project

Everything above runs the example inside this repo. To use the SDK in a new, standalone project, you don't need this repo or any workspaces. Install the published package from npm:

mkdir my-app && cd my-app
npm init -y
npm pkg set type=module          # the SDK is ESM
npm install @jetty/sdk eve

The pattern isn't eve-specific: anywhere you can produce an agent output and call jetty.runAndWait(...) on it (eve, Flue, LangChain, a raw provider SDK, a hand-rolled loop), Jetty drops in as the eval layer. Copy the orchestration from src/ab-eval.ts into your own code.

Make it yours

Add cases: append to TICKETS in src/tickets.ts.
Compare your own versions: edit the two entries in CONFIGS, or change EVE_MODEL in agent/agent.ts to A/B models.
Move the bar: change PASS_BAR in src/eval.ts.
Swap the grader: the rubric in grader/RUNBOOK.md is plain Python. Replace it with an LLM-judge call for model-based grading, then npm run deploy-grader.

Protect sensitive content

Trajectories persist step inputs and outputs, so they're content-bearing. Put credentials in secretParams (kept out of the stored trajectory), not initParams. If a draft can carry PII, redact it before grading, or grade a hash or summary instead. Treat trajectory storage like any other logging surface.

Troubleshooting

The harness can't reach the agent. Start it first with npx eve dev (Node 24+) and point EVE_URL at it. eve's HTTP channel fails closed for non-loopback traffic; for a deployed agent, add an authenticator in agent/channels/eve.ts.
The agent didn't return JSON. The triage prompt asks for a bare JSON object; extractTriage tolerates fences and prose, but a chatty model can still wander. Tighten agent/instructions.md.
grader produced no /app/results files. The grader must run on a supported model, keep its secrets frontmatter, and write to /app/results/. All three are set in grader/RUNBOOK.md.
The live run is slow. Each grade spins up a sandbox (a few minutes for 2 tickets); that's expected, and the offline demo (npm run demo) is the fast path.

Note on scope: Jetty has no external trajectory-ingestion endpoint yet, so grading runs through a Jetty task (which is what creates the trajectory) rather than pushing an externally-produced trace. That endpoint is also the unlock for the native Jetty() eve eval reporter.