Canary

QA a described user flow against a live web app — the agent drives a real browser and hands back a report plus a replay script.

Fork and run

Create an account to run on Jetty. Free for your first 10 runs.

Run time3–5 mins

Version1.0.0

Agent + Model

claude-codeclaude-sonnet-4-6

Origingithub.com/wizenheimer/canary

Example runs

Example 1

TodoMVC — add & complete a todo

Add two todos and complete one. 7/7 steps pass, with real assertions on the counter (2 → 1) and the completed state. Self-contained report + replay.py.

Inputs

Target URLhttps://demo.playwright.dev/todomvc

Acceptance checklist

4/4 checks passed.

Setup
Drive
Replay
Report

TodoMVC — add & complete a todo — output preview

Published outputs

report.html replay.py steps.json console.log

System outputs

validation_report.json summary.md

Runbook

version	1.0.0
evaluation	programmatic
agent	claude-code
model	claude-sonnet-4-6
model_provider	anthropic
snapshot	prism-playwright
primary_outputs	report.html, replay.py

Canary — Agent-Driven Browser QA — Agent Runbook

Converted, with attribution, from Canary (github.com/wizenheimer/canary, MIT) — a QA harness for coding agents. Canary itself ships a CLI + a QuickJS-WASM Playwright sandbox + a daemon; this runbook re-implements its core idea with plain Playwright so it runs in the Jetty sandbox: describe a flow, the agent drives a real browser, and you get back a self-contained report and a reusable replay script.

EXECUTE THIS RUNBOOK NOW. Drive the browser with tools and write every deliverable to {{results_dir}}. This is a task to perform, not a document to summarize. Your first action is a tool call (Step 1).

Inputs (already provided)

Target URL: {{target_url}} — where the flow starts.
Flow: {{flow}} — the user journey to QA, in plain language, with the checks that must hold (visible text / URL / element state / no console error).
Credentials (optional): {{credentials}} — e.g. user=...,pass=... for a login step.

Objective

QA a described user flow against a live web app the way Canary does: the agent drives a real (headless) browser through small, intent-named steps, and captures evidence at every step — a screenshot, console messages, and network activity — plus a Playwright trace for the whole session. Each step that encodes a check is an assertion (visible text, URL, element state, no console error). The run produces two things Canary insists on having together: a report you can just read (report.html, self-contained) and the exact reusable script (replay.py) that reproduces the flow in CI with zero agent cost. Don't make the user choose between an opaque agent run and hand-written Playwright — hand back both.

REQUIRED OUTPUT FILES (MANDATORY)

You MUST write all of the following to {{results_dir}}. The task is NOT complete until every file exists and is non-empty. No exceptions.

File	Description
`{{results_dir}}/report.html`	Self-contained QA report: per-step status, the inline screenshot of each step, console errors, a network summary, and the overall verdict. Open it, commit it, send it.
`{{results_dir}}/replay.py`	The reusable Playwright script that reproduces the flow exactly — re-runnable in CI with no agent cost.
`{{results_dir}}/steps.json`	Structured per-step results: name, action, check, status, screenshot path, console errors.
`{{results_dir}}/trace.zip`	The Playwright trace for the whole session (open with `playwright show-trace`).
`{{results_dir}}/console.log`	All browser console messages captured during the run.
`{{results_dir}}/network.har`	The network HAR for the session.
`{{results_dir}}/summary.md`	Executive summary: flow, verdict, steps passed/failed, the single most important finding.
`{{results_dir}}/validation_report.json`	Stage-by-stage validation with `overall_passed`. See Step 5.

Screenshots go in {{results_dir}}/screenshots/. If you finish but have not written every file, go back and write it.

Parameters

Parameter	Template Variable	Default	Description
Results directory	`{{results_dir}}`	`/app/results` (Jetty) / `./results` (local)	Output directory
Target URL	`{{target_url}}`	(required)	Where the flow starts
Flow	`{{flow}}`	(required)	The plain-language flow + the checks that must hold
Credentials	`{{credentials}}`	(optional)	Login creds if the flow needs them
Headless	`{{headless}}`	`true`	Run the browser headless (always true on Jetty)

Dependencies

Dependency	Type	Required	Description
`playwright` (Python) + Chromium	Runtime	Yes	Pre-installed on the `prism-playwright` snapshot

Step 1: Environment Setup

mkdir -p "{{results_dir}}/screenshots"
python -c "import playwright; print('playwright', playwright.__version__)" || python -m pip install --quiet playwright
python -m playwright install chromium 2>/dev/null || true
SITE="{{target_url}}"
[ -n "$SITE" ] && [ "$SITE" != "{{target_url}}" ] || { echo "ERROR: no target_url provided"; exit 1; }
echo "QA target: $SITE"

Step 2: Explore, then Drive the Flow

First observe the target: fetch the page, note the real selectors for the elements the flow touches (don't guess — read the DOM). Then translate the plain-language {{flow}} into small, intent-named steps and drive them with the harness below, capturing evidence per step. Each step is either an action (navigate, click, fill) or an assertion (a check from the flow). Re-read selectors from the live page if one fails (max 3 retries per step), the way Canary's explore-and-record loop does.

# Canary-style harness: one screenshot + console + per-step status, plus a full trace.
import json, pathlib, sys
from playwright.sync_api import sync_playwright

RESULTS = "{{results_dir}}"
TARGET  = "{{target_url}}"
pathlib.Path(f"{RESULTS}/screenshots").mkdir(parents=True, exist_ok=True)

console_msgs, steps = [], []

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context(record_har_path=f"{RESULTS}/network.har", viewport={"width": 1280, "height": 800})
    context.tracing.start(screenshots=True, snapshots=True, sources=True)
    page = context.new_page()
    page.on("console", lambda m: console_msgs.append({"type": m.type, "text": m.text}))
    page.on("pageerror", lambda e: console_msgs.append({"type": "pageerror", "text": str(e)}))

    def step(name, action, fn, check=None):
        """Run one intent-named step; screenshot + record status. check() -> bool|None."""
        rec = {"name": name, "action": action, "status": "pass", "console_errors": 0, "error": None}
        try:
            fn(page)
            page.wait_for_timeout(400)
            if check is not None:
                rec["status"] = "pass" if check(page) else "fail"
        except Exception as e:
            rec["status"] = "fail"; rec["error"] = str(e)[:300]
        shot = f"{RESULTS}/screenshots/{len(steps)+1:02d}-{name.replace(' ','_')[:40]}.png"
        try: page.screenshot(path=shot, full_page=False)
        except Exception: shot = None
        rec["screenshot"] = (shot.split("/")[-1] if shot else None)
        rec["console_errors"] = sum(1 for m in console_msgs if m["type"] in ("error", "pageerror"))
        steps.append(rec)
        print(f"  [{rec['status'].upper()}] {name}")
        return rec

    # ---- EXAMPLE shape — REPLACE these with the steps for {{flow}} ----
    step("open the app", "navigate", lambda pg: pg.goto(TARGET, wait_until="domcontentloaded"))
    # step("submit the form", "click", lambda pg: pg.click("button[type=submit]"))
    # step("result is visible", "assert", lambda pg: None,
    #      check=lambda pg: pg.get_by_text("Success").is_visible())
    # ------------------------------------------------------------------

    context.tracing.stop(path=f"{RESULTS}/trace.zip")
    context.close(); browser.close()

pathlib.Path(f"{RESULTS}/console.log").write_text("\n".join(f"[{m['type']}] {m['text']}" for m in console_msgs))
pathlib.Path(f"{RESULTS}/steps.json").write_text(json.dumps(steps, indent=2))
passed = sum(1 for s in steps if s["status"] == "pass")
print(f"verdict: {passed}/{len(steps)} steps passed")

Drive real assertions from the flow's checks (visible text, URL, element state). A step with no check is an action; a step that encodes a check is an assertion and must actually verify the page, not just not-throw.

Step 3: Write the Reusable Replay Script

Write {{results_dir}}/replay.py — a standalone Playwright script (no agent, no harness) that reproduces the exact flow and exits non-zero if any assertion fails. This is the artifact that runs in CI with zero inference cost.

# replay.py — generated; reproduces the QA flow headless and asserts each check.
from playwright.sync_api import sync_playwright, expect

def main():
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_context().new_page()
        page.goto("{{target_url}}")
        # ... the exact steps + expect(...) assertions from Step 2 ...
        browser.close()

if __name__ == "__main__":
    main()

It must mirror Step 2's steps one-to-one. Smoke-test it: python {{results_dir}}/replay.py should exit 0 on a passing flow.

Step 4: Build `report.html`

Write a self-contained {{results_dir}}/report.html (no external assets — inline the screenshots as base64). Include: the flow + target URL, the overall verdict (pass/fail), and for each step its name, action, status, inline screenshot, and console-error count; plus a network summary (total requests, failures) and a footer noting trace.zip / replay.py.

import base64, json, pathlib
RESULTS = "{{results_dir}}"
steps = json.load(open(f"{RESULTS}/steps.json"))
def img(name):
    p = pathlib.Path(f"{RESULTS}/screenshots/{name}")
    if not (name and p.exists()): return ""
    b = base64.b64encode(p.read_bytes()).decode()
    return f'<img src="data:image/png;base64,{b}" style="max-width:560px;border:1px solid #ddd;border-radius:6px"/>'
passed = sum(1 for s in steps if s["status"]=="pass"); total=len(steps)
verdict = "PASS" if passed==total and total>0 else "FAIL"
rows = "".join(
  f'<div style="margin:18px 0;padding:14px;border-left:4px solid {"#22c55e" if s["status"]=="pass" else "#ef4444"};background:#fafafa">'
  f'<b>{i+1}. {s["name"]}</b> <span style="color:#666">({s["action"]})</span> '
  f'<span style="float:right;color:{"#16a34a" if s["status"]=="pass" else "#dc2626"}">{s["status"].upper()}</span>'
  f'<div style="color:#999;font-size:13px">console errors: {s.get("console_errors",0)}{" · "+s["error"] if s.get("error") else ""}</div>'
  f'<div style="margin-top:8px">{img(s.get("screenshot"))}</div></div>'
  for i,s in enumerate(steps))
html = f"""<!doctype html><meta charset=utf-8><title>Canary QA Report</title>
<body style="font-family:system-ui;max-width:760px;margin:40px auto;color:#111">
<h1>Canary QA Report</h1>
<p><b>Target:</b> {{target_url}}<br><b>Verdict:</b>
<span style="font-weight:700;color:{'#16a34a' if verdict=='PASS' else '#dc2626'}">{verdict}</span>
&nbsp;({passed}/{total} steps)</p>{rows}
<hr><p style="color:#888;font-size:13px">Evidence: trace.zip (playwright show-trace) · network.har · replay.py · console.log</p>
</body>"""
pathlib.Path(f"{RESULTS}/report.html").write_text(html)
print("wrote report.html", verdict)

Step 5: Evaluate, Validate & Iterate (max 3 rounds)

Status	Criteria
`PASS`	The flow drove ≥ 2 steps, evidence was captured for each (screenshot + console + the shared trace.zip + network.har), every assertion step ran a real check, `report.html` and `replay.py` both exist and are non-empty, and `replay.py` smoke-runs without import/syntax errors.
`PARTIAL`	The flow ran but a non-blocking step failed (e.g. one optional assertion), or `replay.py` reproduces only part of the flow. Report which step and why.
`FAIL`	The browser couldn't drive the flow at all (target unreachable, every step errored), or `report.html`/`replay.py` is missing.

If a step failed on a brittle selector, re-read the live DOM and fix the selector (max 3 rounds), then re-run. Write validation_report.json:

{
  "version": "1.0.0", "run_date": "<ISO>",
  "parameters": { "target_url": "{{target_url}}" },
  "stages": [
    { "name": "setup",   "passed": true, "message": "playwright + chromium ready" },
    { "name": "drive",   "passed": true, "message": "N steps driven, evidence captured" },
    { "name": "replay",  "passed": true, "message": "replay.py written and smoke-runs" },
    { "name": "report",  "passed": true, "message": "report.html + all artifacts written" }
  ],
  "results": { "steps_total": 0, "steps_passed": 0, "verdict": "PASS|FAIL" },
  "overall_passed": true
}

overall_passed is true iff every stage passed and report.html + replay.py exist.

Step 6: Write Executive Summary

Write {{results_dir}}/summary.md:

# Canary QA — Results

## Overview
- **Date**: <ISO>  ·  **Target**: {{target_url}}
- **Flow**: <one-line restatement>
- **Verdict**: PASS|FAIL  ·  **Steps**: <passed>/<total>

## Steps
| # | Step | Action | Status | Console errors |
|---|------|--------|--------|----------------|

## Most important finding
<one sentence: the failing assertion, or "flow passed clean — N steps, 0 console errors">

## Artifacts
- report.html (self-contained) · replay.py (CI-ready) · trace.zip · network.har · console.log

Final Checklist (MANDATORY — do not skip)

Verification Script

echo "=== FINAL OUTPUT VERIFICATION ==="
RESULTS_DIR="{{results_dir}}"
for f in "$RESULTS_DIR/report.html" "$RESULTS_DIR/replay.py" "$RESULTS_DIR/steps.json" \
         "$RESULTS_DIR/trace.zip" "$RESULTS_DIR/console.log" "$RESULTS_DIR/network.har" \
         "$RESULTS_DIR/summary.md" "$RESULTS_DIR/validation_report.json"; do
  [ -s "$f" ] && echo "PASS: $f ($(wc -c < "$f") bytes)" || echo "FAIL: $f is missing or empty"
done
SHOTS=$(ls "$RESULTS_DIR"/screenshots/*.png 2>/dev/null | wc -l | tr -d ' ')
[ "$SHOTS" -ge 2 ] && echo "PASS: $SHOTS step screenshots" || echo "FAIL: too few screenshots ($SHOTS)"
python3 -c "import ast; ast.parse(open('$RESULTS_DIR/replay.py').read()); print('PASS: replay.py parses')" || echo "FAIL: replay.py has a syntax error"
echo "=== VERIFICATION COMPLETE ==="

Checklist

The flow drove ≥ 2 intent-named steps against the live target
Every step has a screenshot; console + network + the shared trace.zip were captured
Each assertion step ran a real check (visible text / URL / state), not just no-throw
report.html is self-contained (screenshots inlined, opens with no external assets)
replay.py reproduces the flow and parses/smoke-runs cleanly
summary.md states the verdict and the single most important finding
validation_report.json has stages, results, overall_passed

If ANY item fails, go back and fix it. Do NOT finish until all items pass.

Tips

Observe before you drive. Read the live DOM for real selectors; don't guess. Canary's edge is exploring the actual page, not replaying a brittle pre-written script.
An assertion must assert. A step that "checks" something has to verify the page (text visible, URL changed, element enabled) — a step that merely doesn't throw is an action, not a check. Console errors are a check too: a clean flow has zero.
Hand back both. The whole point is a readable report.html AND the exact replay.py. The report is for a human; the script runs in CI with zero agent cost on replay.
Small, intent-named steps. "log in", "add to cart", "cart shows 1" — not "click #btn-3". Intent names make the report and the trace readable.
Headless, deterministic targets. Public test apps (TodoMVC, the-internet, saucedemo) are ideal: stable selectors, no auth walls, designed to be driven.