Reliable agents. One API. No framework.

Jetty runs Claude Code, Codex, and Gemini CLI in isolated sandboxes with durable execution, eval-driven quality gates, and full trajectory capture. OpenAI-compatible. Zero infra to manage.

# Install the skill
claude plugin marketplace add jettyio/jettyio-skills
claude plugin install jetty@jetty

# Or, just call the API
curl https://flows-api.jetty.io/v1/chat/completions \
  -H "Authorization: Bearer $JETTY_TOKEN" \
  -d '{"model": "claude-sonnet-4-6", "messages": [...], "jetty": {"runbook": true, "task": "analyze-data"}}'

The failure mode you've lived with

You've shipped agents. They work on your machine. They skip a step in CI and declare success. They call the right APIs and then route around the one that errored. Your /review-pr script was supposed to run six checks and it ran four and the PR merged anyway.

You've tried the remedies. LangGraph for the graph model. Temporal for durability. A custom wrapper around each agent CLI. Retry logic held together by hope. Each of them solved one problem and introduced two more: framework weight, worker processes, YAML configuration, and a steady drift between what the agent was supposed to do and what it actually did.

You want something between “a prompt” and “standing up a platform.” You want to keep your instructions in version control. You want to swap Claude for Gemini with a config change. You want durability without running Temporal workers yourself.

That's what Jetty is.

How it works

Jetty exposes a single endpoint: /v1/chat/completions. Speaks the OpenAI Chat Completions protocol. Your existing SDKs work unchanged.

Add a jetty block to the request. That switches the endpoint from passthrough mode to agent mode:

{
  "model": "claude-sonnet-4-6",
  "messages": [
    {"role": "system", "content": "You are a code reviewer..."},
    {"role": "user", "content": "Review the PR diff"}
  ],
  "jetty": {
    "runbook": true,
    "collection": "my-org",
    "task": "pr-review",
    "agent": "claude-code",
    "file_paths": ["uploads/diff.patch"]
  }
}

What happens when that request hits Jetty:

Sandbox provisioned. Isolated container from a pre-built snapshot — Python toolchain, browser automation, or custom.
Agent installed. Claude Code, Codex, or Gemini CLI, depending on the agent field.
Files uploaded. Whatever you referenced in file_paths is mounted into the workspace.
Runbook injected. Your system prompt becomes the agent's mission — instructions, standards, verification checks.
Agent executes freely. Full shell, Python, network, file I/O. Installs packages. Runs scripts.
Artifacts collected. Everything in /app/results/ persists to object storage.
Trajectory recorded. Full step-by-step history: inputs, outputs, tokens, cost, timing.
Response returned. Structured result with file URLs. Streaming via SSE, or async with a webhook callback.

The sandbox is destroyed. The trajectory isn't. You can replay, compare across models, label runs for evaluation, and wire the trajectory into your observability stack.

What you'd use it for

CI for AI

Trigger a workflow from GitHub Actions. The agent diffs the PR for bugs, security, and style. An LLM-as-judge scores the output. If the score drops below your threshold, the merge is blocked and the finding is posted as a PR comment.

Document processing pipelines

User uploads a PDF. The agent extracts structured data, validates it against a schema, fixes errors up to three times, and returns a validation report plus the extracted JSON. ~200 lines of app code.

Evaluation harnesses

Run the same runbook across Claude Sonnet, GPT-4o, and Gemini 2.5. Every trajectory captures inputs, outputs, tokens, cost. Compare them side by side. This is how you pick a model — on your real work, not a benchmark.

Multi-step workflow DAGs

47+ step types. Path expressions (step_a.outputs.files[0].path) wire output from one step into input of the next. Built-in branching, loops, iteration over collections, and LLM-as-judge evaluation steps.

Why not LangGraph or Temporal

Both are good for what they are. LangGraph gives you structural graph semantics. Temporal gives you durable execution primitives. For agent work that's genuinely graph-shaped — with branching, parallel workers, audit-grade state — they're the right tool.

For work that's mostly judgment-shaped — SQL reviews, content checks, document extraction, evaluation pipelines — they're heavier than the job justifies. A runbook is lighter, portable, diffable in a PR, and good enough. Jetty doesn't compete with those frameworks; it covers the work where their weight is overkill.

If you've already got Temporal for your core pipeline, Jetty can be a step inside a Temporal workflow. The two compose.

What you get out of the box

OpenAI-compatible API. Drop-in for /v1/chat/completions. Existing SDKs work. The jetty extension is additive.
100+ model providers. OpenAI, Anthropic, Google, Mistral, Cohere, Groq. Switch models with a field change.
Agent-agnostic sandbox. Claude Code, Codex, Gemini CLI today. Any CLI that takes an instruction and produces files tomorrow.
Durable execution. Temporal-backed. Workflows survive crashes. Retries, timeouts, and webhooks built in.
Full trajectory capture. Inputs, outputs, intermediate files, step timings, agent logs, tokens, cost.
Structured evals. LLM-as-judge steps. Rubric scoring. Hill-climbing loops with a cap. Quality gates that plug into CI.
Markdown as orchestration. The runbook is the spec. The markdown is the source of truth. No YAML configs, no graph DSLs.

Read more in the docs and in Jon's essay Runbooks for Agents.

The sharp edges, called honestly

Tool support is what it is. Claude Code, Codex, and Gemini CLI are first-class. New agents get added as they emerge. If you're running something exotic, ask us.
Sandbox snapshots are opinionated. We ship Python, browser automation, and a general-purpose image. Custom snapshots are supported, but not a one-click experience yet.
The web UI is thin. Jetty is primarily an API. The Spot UI is for browsing runs and trajectories, not building runbooks graphically — and we think that's the right call. Runbooks belong in your editor, in your repo.

Install and ship

Install the agent skill →

claude plugin marketplace add jettyio/jettyio-skills
claude plugin install jetty@jetty
/jetty-setup

Account, token, and a demo workflow in 3–5 minutes. First runbook in another 10.

Read the API docs →

If you'd rather wire it in from scratch.