Cross-model benchmark

Run the same prompt through two or more models side by side and score the results.

Fork and run
Create an account to run on Jetty. Free for your first 10 runs.
Run time2–4 mins
Version1.0.0
Agent + Model
claude-codeclaude-sonnet-4-6

Example runs

Example 1

Coding prompt — merge intervals

Three models implement merge_intervals(). gpt-4o fastest + cheapest; claude and gpt tie at quality 10; gemini-2.5-flash verbose and slower.

Inputs

Modelsclaude-sonnet-4-6,gpt-4o,gemini/gemini-2.5-flash
Judgetrue

Acceptance checklist

5/5 checks passed.
  • Setup
  • Preflight
  • Run
  • Judge
  • Report

Runbook

version1.0.0
evaluationprogrammatic
agentclaude-code
modelclaude-sonnet-4-6
model_provideranthropic
snapshotpython312-uv
primary_outputsbenchmark.md

Cross-Model Benchmark — Agent Runbook

Converted, with attribution, from Garry Tan's benchmark-models skill (github.com/garrytan/gstack, MIT). The original wraps the gstack-model-benchmark binary + the Claude/Codex/Gemini CLIs; this runbook re-implements the same idea as a self-contained, provider-agnostic benchmark via litellm, so it runs anywhere.

EXECUTE THIS RUNBOOK NOW. Run the benchmark with tools and write every deliverable to {{results_dir}}. This is a task to perform, not a document to summarize. Your first action is a tool call (Step 1).

Inputs (already provided)

  • Prompt: the first *.txt / *.md uploaded into /app/assets/, or the {{prompt}} parameter if no file is present.
  • Models: {{models}} — comma-separated litellm model ids (e.g. claude-sonnet-4-6,gpt-4o,gemini/gemini-2.0-flash).
  • System prompt (optional): {{system_prompt}}
  • Judge: {{judge}} (true = score each output 0–10 with an LLM judge).

Objective

Run the same prompt through two or more models side by side and answer "which model is actually best for this task?" with data instead of vibes. For each model, measure latency, tokens (prompt + completion), and cost, capture the output, and — when the judge is on — score output quality 0–10. Then recommend the fastest, cheapest, highest-quality, and best-overall model, surfacing the tradeoff the user has to make. Models that error (bad key, quota, timeout) are skipped cleanly and reported, never aborting the batch — a benchmark with one provider down is still useful.

This is the provider-agnostic engine behind "model shootouts": pick the right model for a prompt, a skill, or a workload, and catch quality/cost regressions when providers ship new versions.


REQUIRED OUTPUT FILES (MANDATORY)

You MUST write all of the following to {{results_dir}}. The task is NOT complete until every file exists and is non-empty. No exceptions.

FileDescription
{{results_dir}}/benchmark.mdThe comparison report: a per-model table (latency / tokens / cost / quality) and the fastest / cheapest / highest-quality / best-overall recommendation. The headline deliverable.
{{results_dir}}/results.jsonFull structured results — one object per model with metrics, the raw output, judge score/reason, and any error.
{{results_dir}}/summary.mdExecutive summary: models run vs skipped, the recommendation, and the cost of the benchmark itself.
{{results_dir}}/validation_report.jsonStage-by-stage validation with overall_passed. See Step 6.

If you finish but have not written all files, go back and write them first.


Parameters

ParameterTemplate VariableDefaultDescription
Results directory{{results_dir}}/app/results (Jetty) / ./results (local)Output directory
Prompt{{prompt}}(empty → use the uploaded file)Inline prompt text (used only if no file is uploaded)
Models{{models}}claude-sonnet-4-6,gpt-4o,gemini/gemini-2.0-flashComma-separated litellm model ids
System prompt{{system_prompt}}(empty)Optional system prompt sent to every model
Judge{{judge}}truetrue to score each output 0–10 with an LLM judge
Judge model{{judge_model}}claude-sonnet-4-6The model used as the quality judge

Dependencies

DependencyTypeRequiredDescription
litellmPython packageYesOne API across Anthropic / OpenAI / Gemini / OpenRouter; gives token usage + cost
At least one provider keyCredentialYesANTHROPIC_API_KEY / OPENAI_API_KEY / GEMINI_API_KEY / OPENROUTER_API_KEY (Jetty trial keys cover the first three on the platform)

Step 1: Environment Setup & Provider Preflight

python -m pip install --quiet "litellm>=1.40" mkdir -p "{{results_dir}}" echo "Provider keys present:" for k in ANTHROPIC_API_KEY OPENAI_API_KEY GEMINI_API_KEY OPENROUTER_API_KEY; do [ -n "${!k}" ] && echo " $k: SET" || echo " $k: (absent)" done

Preflight (the dry-run, from the source skill): map each requested model to its provider and check the key is present. Models whose provider key is absent are reported as skipped: no_key and excluded — they do not abort the run. If zero models have a key, STOP and write a clear error (a benchmark needs at least one authed provider).

Provider inference: claude*/anthropic/*ANTHROPIC_API_KEY; gpt*/o1*/openai/*OPENAI_API_KEY; gemini*GEMINI_API_KEY; openrouter/*OPENROUTER_API_KEY.


Step 2: Resolve the Prompt

Use the first *.txt/*.md in /app/assets/; if none, use {{prompt}}. If both are empty, STOP with an error. Record the resolved prompt source in summary.md.


Step 3: Run the Benchmark

Run the same prompt through every authed model, measuring latency, tokens, and cost. Errors are caught per-model and recorded, never raised.

import litellm, time, json, pathlib, glob litellm.drop_params = True # tolerate provider-specific params litellm.suppress_debug_info = True MODELS = [m.strip() for m in "{{models}}".split(",") if m.strip()] SYSTEM = """{{system_prompt}}""".strip() RESULTS_DIR = "{{results_dir}}" # Resolve the prompt cands = sorted(glob.glob("/app/assets/*.txt") + glob.glob("/app/assets/*.md")) PROMPT = pathlib.Path(cands[0]).read_text() if cands else """{{prompt}}""" PROMPT = PROMPT.strip() assert PROMPT, "no prompt provided" def msgs(): m = [] if SYSTEM and SYSTEM != "{{" + "system_prompt}}": m.append({"role": "system", "content": SYSTEM}) m.append({"role": "user", "content": PROMPT}) return m results = [] for model in MODELS: rec = {"model": model} try: t0 = time.perf_counter() resp = litellm.completion(model=model, messages=msgs(), timeout=180) dt = time.perf_counter() - t0 out = resp.choices[0].message.content or "" u = resp.usage try: cost = litellm.completion_cost(completion_response=resp) except Exception: cost = None rec.update({"ok": True, "latency_s": round(dt, 2), "prompt_tokens": getattr(u, "prompt_tokens", None), "completion_tokens": getattr(u, "completion_tokens", None), "total_tokens": getattr(u, "total_tokens", None), "cost_usd": (round(cost, 6) if cost is not None else None), "output": out}) except Exception as e: rec.update({"ok": False, "error": str(e)[:300]}) results.append(rec) print(model, "->", "OK" if rec.get("ok") else "ERR")

Step 4: Judge Quality (when {{judge}} is true)

For each successful output, ask the judge model to score it 0–10 on correctness, completeness, and clarity, returning strict JSON. Skip judging for models that errored.

JUDGE = "{{judge}}".strip().lower() in ("true", "1", "yes") JUDGE_MODEL = "{{judge_model}}".strip() or "claude-sonnet-4-6" if JUDGE: for r in results: if not r.get("ok"): continue jp = (f"Score the ANSWER below as a response to the PROMPT, 0-10, on correctness, " f"completeness, and clarity combined.\n\nPROMPT:\n{PROMPT}\n\nANSWER:\n{r['output']}\n\n" f'Return ONLY JSON: {{"score": <0-10 number>, "reason": "<one sentence>"}}') try: jr = litellm.completion(model=JUDGE_MODEL, messages=[{"role": "user", "content": jp}], timeout=120) jt = jr.choices[0].message.content obj = json.loads(jt[jt.find("{"): jt.rfind("}") + 1]) r["judge_score"] = obj.get("score") r["judge_reason"] = obj.get("reason") except Exception as e: r["judge_score"] = None r["judge_error"] = str(e)[:200]

Step 5: Recommend & Write Outputs

ok = [r for r in results if r.get("ok")] def pick(seq, key, best=min): seq = [r for r in seq if r.get(key) is not None] return best(seq, key=lambda r: r[key]) if seq else None fastest = pick(ok, "latency_s", min) cheapest = pick(ok, "cost_usd", min) top_qual = pick(ok, "judge_score", max) bench_cost = round(sum((r.get("cost_usd") or 0) for r in results), 6) pathlib.Path(f"{RESULTS_DIR}/results.json").write_text(json.dumps( {"prompt_chars": len(PROMPT), "models": results, "recommendation": {"fastest": fastest and fastest["model"], "cheapest": cheapest and cheapest["model"], "highest_quality": top_qual and top_qual["model"]}}, indent=2, ensure_ascii=False)) # benchmark.md — the table + recommendation def row(r): if not r.get("ok"): return f"| {r['model']} | — | — | — | — | skipped: {r.get('error','')[:40]} |" q = r.get("judge_score"); c = r.get("cost_usd") return (f"| {r['model']} | {r['latency_s']}s | {r.get('total_tokens','?')} | " f"{('$%.5f'%c) if c is not None else 'n/a'} | {q if q is not None else '—'} | ok |") md = ["# Cross-Model Benchmark\n", f"Prompt: {len(PROMPT)} chars · Models: {len(results)} ({len(ok)} ran) · Judge: {JUDGE}\n", "| Model | Latency | Tokens | Cost | Quality (0-10) | Status |", "|-------|---------|--------|------|----------------|--------|"] md += [row(r) for r in results] md.append("\n## Recommendation\n") md.append(f"- **Fastest:** {fastest['model'] if fastest else 'n/a'}") md.append(f"- **Cheapest:** {cheapest['model'] if cheapest else 'n/a'}") if JUDGE: md.append(f"- **Highest quality:** {top_qual['model'] if top_qual else 'n/a'}") md.append(f"- **Benchmark cost:** ${bench_cost}") pathlib.Path(f"{RESULTS_DIR}/benchmark.md").write_text("\n".join(md)) print("wrote benchmark.md + results.json")

The "best overall" is a judgment call, not a formula — state the tradeoff in summary.md (e.g. "Gemini was 3× cheaper and nearly as fast, but Claude scored 2 points higher on quality; pick by whether this workload is quality- or cost-bound"). Cross-model agreement is a recommendation; the user decides.


Step 6: Evaluate & Validate

StatusCriteria
PASSAt least 2 models ran successfully; each successful model has latency + tokens recorded; benchmark.md has the table and a recommendation; (judge on) each successful model has a judge_score or a recorded judge_error.
PARTIALExactly 1 model ran (others skipped/errored) — a single-model run has no cross-model signal; or cost was unavailable for every model (pricing unknown).
FAILZero models ran, results.json is invalid, or benchmark.md lacks the table/recommendation.

Write validation_report.json:

{ "version": "1.0.0", "run_date": "<ISO timestamp>", "parameters": { "models": "{{models}}", "judge": "{{judge}}" }, "stages": [ { "name": "setup", "passed": true, "message": "litellm installed; N provider keys present" }, { "name": "preflight","passed": true, "message": "M models authed, K skipped (no key)" }, { "name": "run", "passed": true, "message": "R models ran, E errored" }, { "name": "judge", "passed": true, "message": "Scored R outputs (or judge off)" }, { "name": "report", "passed": true, "message": "All output files written" } ], "results": { "models_requested": 0, "models_ran": 0, "models_errored": 0, "judge_enabled": true, "benchmark_cost_usd": 0 }, "overall_passed": true }

overall_passed is true iff models_ran >= 2 and every output file exists.


Step 7: Iterate (max 3 rounds)

If a model errored with a fixable cause, fix and retry only that model:

ErrorFix
AuthenticationError / missing keyThe provider key isn't set — skip that model and note it; don't retry without a key.
RateLimitError / quota (common on trial Gemini keys)Wait briefly and retry once; if it persists, mark skipped: quota and continue.
BadRequestError: model not foundThe litellm model id is wrong for the provider — correct it (e.g. gemini/gemini-2.0-flash, not gemini-2.0-flash).
TimeoutRaise the per-call timeout once; if it persists, record skipped: timeout.
completion_cost raised / cost_usd nulllitellm has no pricing for that model id — leave cost null and note it; don't fail the run.

Max 3 rounds; then keep what ran and surface the rest in summary.md.


Step 8: Write Executive Summary

Write {{results_dir}}/summary.md:

# Cross-Model Benchmark — Results ## Overview - **Date**: <ISO timestamp> · **Prompt source**: <file or inline> · **Judge**: {{judge}} - **Models requested**: <N> · **Ran**: <R> · **Skipped/errored**: <list with reasons> - **Benchmark cost**: $<...> ## Results | Model | Latency | Tokens | Cost | Quality | |-------|---------|--------|------|---------| ## Recommendation - **Fastest** / **Cheapest** / **Highest quality**: <...> - **Best overall**: <judgment + the tradeoff the user must make> ## Notes / Limitations <e.g. Gemini skipped (trial quota); cost unavailable for model X (no pricing).>

Final Checklist (MANDATORY — do not skip)

Verification Script

echo "=== FINAL OUTPUT VERIFICATION ===" RESULTS_DIR="{{results_dir}}" for f in "$RESULTS_DIR/benchmark.md" "$RESULTS_DIR/results.json" "$RESULTS_DIR/summary.md" "$RESULTS_DIR/validation_report.json"; do [ -s "$f" ] && echo "PASS: $f ($(wc -c < "$f") bytes)" || echo "FAIL: $f is missing or empty" done python3 - <<PY import json d = json.load(open("$RESULTS_DIR/results.json")) ran = [m for m in d["models"] if m.get("ok")] print(f"PASS: {len(ran)} models ran" if len(ran) >= 2 else f"WARN: only {len(ran)} model ran (no cross-model signal)") for m in ran: assert m.get("latency_s") is not None and m.get("total_tokens") is not None, f"missing metrics: {m['model']}" print("PASS: all ran models have latency + tokens") PY echo "=== VERIFICATION COMPLETE ==="

Checklist

  • At least 2 models ran (or the run is honestly marked PARTIAL with the reason)
  • Each ran model has latency + tokens; cost where pricing is known
  • benchmark.md has the per-model table and a fastest/cheapest/quality recommendation
  • results.json is valid and includes each model's raw output + any error
  • summary.md names the best-overall tradeoff, not just a single winner
  • Skipped/errored models are reported with a reason (no silent drops)
  • validation_report.json has stages, results, overall_passed

If ANY item fails, go back and fix it. Do NOT finish until all items pass.


Tips

  • Never benchmark without the preflight. Check provider keys first (the source skill's dry-run). A model with no key is skipped cleanly, never an aborted batch.
  • One provider down is still a benchmark. Trial Gemini keys often have zero quota — record skipped: quota and compare the rest; don't fail the run.
  • Cost can be null. litellm only computes cost for model ids it has pricing for. A null cost is a known gap, not a failure — report it.
  • The judge is the point, but it costs. Quality is why you benchmark; the judge adds a little cost per model. Keep it on unless the user only cares about speed/cost.
  • Best-overall is a tradeoff, not a number. Name what the user is trading (e.g. "2× the cost for +1.5 quality points"). Cross-model agreement is a recommendation; the user decides.
  • Use the same prompt verbatim for every model — any difference invalidates the comparison. Send the same system_prompt to all, too.