eve + Jetty
You shipped an eve agent, built on Vercel's filesystem-first framework where an agent is a directory. You tweaked a prompt. Is it better or worse? You can't tell from one reply. That's the run, check, fix, rerun loop from How to build an AI agent, and Jetty is the check.
eve owns the agent loop. Jetty grades each output and keeps it: every run becomes a trajectory you can score, label, and diff across versions, so a regression shows up before a customer finds it. The grader is a Jetty runbook the agent under test never sees, so a model can't grade its own output and rubber-stamp a regression.
The example is examples/eve-jetty in the SDK repo. It A/B-tests a help-desk triage agent in two configs (warm and terse), grading every reply and flagging the config that regressed:
TICKETS: 2 GRADER: rubric (independent)
config pass avg $/run
v1 (warm) 2/2 4.7 0.0093 ✅
v2 (terse) 0/2 2.7 0.0032 ❌ regressed
→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).eve already has evals. Where Jetty fits
eve ships a real eval system (defineEval, eve eval, an LLM-as-judge, a CI deploy gate). Those are your unit and integration tests: you author the assertions, they run your scripted sessions, they gate a deploy. Jetty sits beside them: an independent grader the builder didn't write, and a durable store where every run is a labelled trajectory you can diff across versions, models, and many agents, long after the CI job ended. Keep eve eval for “did this commit break a rule”; add Jetty for “which version is better, and is it improving over time.”
Install
There's no eve add tooling jetty package. Jetty plugs in through its SDK:
npm install @jetty/sdkRequires @jetty/sdk 0.2.0+ (for gradeWithJetty) and eve 0.16+.
The grading flow
eve's agent is a directory you run with npx eve dev (or deploy to Vercel). Drive it over the typed eve/client, then hand the reply to a Jetty grading task and wait for the trajectory. The grade comes back as a row you can label and compare:
import { Client } from "eve/client";
import { JettyClient, gradeWithJetty } from "@jetty/sdk";
const eve = new Client({ host: process.env.EVE_URL ?? "http://127.0.0.1:2000" });
const jetty = new JettyClient(); // JETTY_API_TOKEN from env or ~/.config/jetty/token
// 1. eve runs the agent (it owns the loop).
const turn = await (await eve.session().send(prompt)).result();
// 2. Jetty grades it server-side, with a grader that isn't the author —
// upload, run the grader, read the grade, and label, in one call.
const { grade, trajectoryId } = await gradeWithJetty(jetty, "acme", "triage-grader", {
files: [{ filename: "case.json", data: turn.message ?? "" }],
useTrialKeys: true, // grade on Jetty's free trial, no provider key
labels: (g) => ({ "eval.grade": String(g.total) }), // labels can read the grade
});Each grade is a Jetty trajectory: the inputs, outputs, grade, and cost, ready to replay. Compare the eval.* labels across configs to see which version slipped.
Native reporter (next). The tighter integration is a
Jetty()eval reporter that drops into eve'sevals.config.tsexactly where the built-inBraintrust(...)reporter goes, so everyeve evalresult lands in Jetty automatically. It ships once Jetty's trajectory-ingestion endpoint lands; the SDK harness above works today.
Configure
| Variable | Required | Purpose |
|---|---|---|
JETTY_API_TOKEN | yes | Jetty API token (also read from ~/.config/jetty/token). |
JETTY_COLLECTION | yes | Collection that owns the grading task. |
JETTY_GRADE_TASK | yes | The grading runbook (e.g. triage-grader). |
JETTY_USE_TRIAL_KEYS | no | Grade on Jetty's free trial, no provider key (see below). |
EVE_URL | for the agent | Where the eve agent is reachable (npx eve dev serves 127.0.0.1:2000). |
AI_GATEWAY_API_KEY / VERCEL_OIDC_TOKEN | for the agent | eve resolves models through Vercel AI Gateway, so the agent needs one of these. |
OPENROUTER_API_KEY | alt. to gateway | No AI Gateway? agent/agent.ts routes the agent through OpenRouter directly, an AI SDK “external” provider that bypasses the gateway (eve's model field accepts any AI SDK language model). |
Credentials. Put anything sensitive in
secretParams, which the server keeps out of the stored trajectory. Don't put secrets ininitParams; that field is persisted. The SDK never logs your token. Tokens resolve from a constructor arg, thenJETTY_API_TOKEN, then~/.config/jetty/token.
What Jetty captures
| eve | Jetty |
|---|---|
| Agent output (the draft) | The input the grading runbook scores |
| Grade (1–5) | Label eval.grade on the trajectory |
| Pass / fail vs. the bar | Label eval.pass |
Per-run cost (estimated from step.completed token usage) | Label eval.cost_est_usd |
| Which agent config / version | Label eval.config |
| The whole graded run | A trajectory: inputs, outputs, steps, replayable |
eve reports token usage but no dollar cost, so the example estimates
$/runfrom tokens and a small per-model price table (src/cost.ts). Tune it to your real rates.
Full walkthrough: catch a regression
A step-by-step run of the example: first offline (no keys, ~10 seconds), then live. There are three levels of do I need a key?:
- Offline demo (
npm run demo): no keys, no eve, no network. - Grading on Jetty: covered by the free trial (10 runs, auto-activated). No API key.
- The live eve agent runs via
npx eve devand resolves models through AI Gateway, so it uses your AI Gateway credential.
1. Clone and build
Run the example bundled in the repo. The -w @jetty/sdk flags refer to this monorepo's workspaces, so clone it and run everything from inside the checkout. (To use the SDK in your own app instead, jump to Use the SDK in your own project.)
git clone https://github.com/jettyio/jetty-sdk.git
cd jetty-sdk
npm install # installs the SDK, the example, and eve (one workspace install)
npm run build -w @jetty/sdk # the example imports the built SDK
cd examples/eve-jetty2. Run the offline demo (no keys)
npm run demoYou should see the verdict table immediately:
Acme Helpdesk — did my last change to the triage agent make it worse?
(simulated; run `npm run ab-eval` against a live `npx eve dev` for the real thing)
TICKETS: 5 GRADER: rubric (independent)
config pass avg $/run
------------ ----- ---- -------
v1 (warm) 5/5 4.5 0.0051 ✅
v2 (terse) 1/5 3.5 0.0039 ❌ regressed
→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).It also writes report.html and opens it: a styled verdict and per-run breakdown, the same report the live run produces. This is a deterministic stand-in, no spend. If you only want to understand the example, you can stop here.
3. Configure credentials (for the live run)
cp .env.example .envEdit .env:
AI_GATEWAY_API_KEY=... # or VERCEL_OIDC_TOKEN, for the eve agent
JETTY_API_TOKEN=mlc_... # your Jetty token
JETTY_COLLECTION=your-collection # a collection your token can write to
JETTY_GRADE_TASK=triage-grader # leave as-is
EVE_URL=http://127.0.0.1:2000 # where `npx eve dev` serves the agentLoad it into your shell (the scripts read process.env; they don't auto-load .env):
set -a && . ./.env && set +a4. Deploy the grader (one time)
The harness calls a Jetty runbook that scores each draft. Deploy it into your collection:
npm run deploy-graderThis creates the triage-grader task from grader/RUNBOOK.md (and pushes a provider key into the collection if you have one). Re-running it updates the task.
5. Serve the agent and run the live A/B
# terminal 1: serve the eve agent (needs Node 24+)
npx eve dev
# terminal 2: A/B-eval it (start small: each ticket is a real server-side grade)
EVAL_TICKETS=2 npm run ab-evalYou'll see a line per run, then the verdict table:
v1 (warm) · reset: 4.7 PASS
v1 (warm) · double-charge: 4.7 PASS
v2 (terse) · reset: 2.7 fail
v2 (terse) · double-charge: 2.7 fail
TICKETS: 2 GRADER: rubric (independent)
config pass avg $/run
v1 (warm) 2/2 4.7 0.0093 ✅
v2 (terse) 0/2 2.7 0.0032 ❌ regressed
→ v2 (terse) is cheaper but fails the bar (4.0). Keep v1 (warm).…then it writes report.html and opens it: the verdict, a per-run breakdown, and links to each Jetty trajectory. The eval caught that the terse config regressed, before it ever reached a customer.
Run it on Jetty's free trial (no key)
Jetty gives every collection a free trial (10 runs, auto-activated, on Jetty's keys), so you can run the grading with no provider key and no key-push:
JETTY_USE_TRIAL_KEYS=true EVAL_TICKETS=2 npm run ab-evalThe trial covers server-side Jetty runs (the grader). The eve agent runs on your machine via npx eve dev, so the full live run still needs an AI Gateway credential for the agent. But you can see Jetty's grading and trajectories on the trial, and npm run demo needs no keys at all. (Sonnet and most models are covered; Opus-class is excluded. After 10 runs, add your own key in Settings.)
6. Inspect what got stored
Every grade is a Jetty trajectory, labelled with eval.config, eval.grade, eval.pass, and eval.cost_est_usd. View them in the Jetty UI (https://flows.jetty.io/<collection>/triage-grader) or from code:
import { JettyClient } from "@jetty/sdk";
const jetty = new JettyClient(); // reads JETTY_API_TOKEN
const list = await jetty.listTrajectories(process.env.JETTY_COLLECTION!, "triage-grader", 5);
for (const t of list.trajectories) {
const full = await jetty.getTrajectory(process.env.JETTY_COLLECTION!, "triage-grader", t.trajectory_id);
const labels = Object.fromEntries(full.labels.map((l) => [l.key, l.value]));
console.log(t.trajectory_id, labels["eval.config"], labels["eval.grade"], labels["eval.pass"]);
}Because the runs are durable and labelled, you can compare configs across releases, long after the terminal session that produced them.
How it works (the pieces)
| File | Role |
|---|---|
agent/instructions.md + agent/agent.ts | The eve agent as a directory: the always-on system prompt and the runtime config. |
src/tickets.ts | The eval cases + the two configs (v1 warm, v2 terse). |
src/ab-eval.ts | The live loop: for each config × ticket → drive eve via eve/client → grade + label → collect. |
src/cost.ts | Estimates per-run cost from eve token usage (eve has no dollar-cost field). |
src/eval.ts | aggregate() (per-config pass-rate/grade/cost) + renderVerdict() (the table). |
grader/RUNBOOK.md | The independent grader: a deterministic Python rubric. |
The SDK does the orchestration: runWithFiles/runAndWait (with file upload), getTrajectory, downloadFile, addLabel, createTask. That's the part worth copying into your own eval.
Use the SDK in your own project
Everything above runs the example inside this repo. To use the SDK in a new, standalone project, you don't need this repo or any workspaces. Install the published package from npm:
mkdir my-app && cd my-app
npm init -y
npm pkg set type=module # the SDK is ESM
npm install @jetty/sdk eveThe pattern isn't eve-specific: anywhere you can produce an agent output and call jetty.runAndWait(...) on it (eve, Flue, LangChain, a raw provider SDK, a hand-rolled loop), Jetty drops in as the eval layer. Copy the orchestration from src/ab-eval.ts into your own code.
Make it yours
- Add cases: append to
TICKETSinsrc/tickets.ts. - Compare your own versions: edit the two entries in
CONFIGS, or changeEVE_MODELinagent/agent.tsto A/B models. - Move the bar: change
PASS_BARinsrc/eval.ts. - Swap the grader: the rubric in
grader/RUNBOOK.mdis plain Python. Replace it with an LLM-judge call for model-based grading, thennpm run deploy-grader.
Protect sensitive content
Trajectories persist step inputs and outputs, so they're content-bearing. Put credentials in secretParams (kept out of the stored trajectory), not initParams. If a draft can carry PII, redact it before grading, or grade a hash or summary instead. Treat trajectory storage like any other logging surface.
Troubleshooting
- The harness can't reach the agent. Start it first with
npx eve dev(Node 24+) and pointEVE_URLat it. eve's HTTP channel fails closed for non-loopback traffic; for a deployed agent, add an authenticator inagent/channels/eve.ts. - The agent didn't return JSON. The triage prompt asks for a bare JSON object;
extractTriagetolerates fences and prose, but a chatty model can still wander. Tightenagent/instructions.md. grader produced no /app/results files. The grader must run on a supported model, keep itssecretsfrontmatter, and write to/app/results/. All three are set ingrader/RUNBOOK.md.- The live run is slow. Each grade spins up a sandbox (a few minutes for 2 tickets); that's expected, and the offline demo (
npm run demo) is the fast path.
Note on scope: Jetty has no external trajectory-ingestion endpoint yet, so grading runs through a Jetty task (which is what creates the trajectory) rather than pushing an externally-produced trace. That endpoint is also the unlock for the native
Jetty()eve eval reporter.