Troubleshooting & gotchas

Most failed runs come from a small set of recurring causes. Each one below has a clear symptom and a clear fix. If you're staring at a run that did nothing, start at the top.

The agent exits in seconds and writes no files

Cause: the model slug doesn't match the provider. A bare slug like claude-sonnet-4-6 sent through OpenRouter resolves to nothing, and the agent quits immediately.

Fix: set model_provider and use the provider-qualified slug, so OpenRouter wants anthropic/claude-sonnet-4.6. Never leave init_params.model empty; an empty model falls back to a bad default and the run no-ops in ~10 seconds.

The agent replies in chat instead of doing the work

Cause: a prose-style system prompt. Some runtimes will answer conversationally and write zero files unless the instructions imperatively demand tool use. Very large prompts (tens of thousands of characters) make this worse.

Fix: make the objective action-oriented: “Read X. Write Y to results/.” Restate required inputs imperatively in the body rather than leaving them in metadata.

A synchronous run returns a 524 but seems to keep going

Cause: sync runs that take longer than ~100 seconds hit a Cloudflare 524 timeout at the edge. The run is still executing server-side, even though the HTTP call gave up.

Fix: don't retry the request (that starts a second run). Poll the trajectory list until it completes, or use async runs with a webhook. For long research runbooks, raise timeout_sec and write output incrementally.

A run sits in “pending” forever

Cause: usually the workflow has no run step. An empty steps array completes instantly as a no-op (0 steps); a runbook task needs a real step (for example steps: ["run"]). Under heavy load, runs can also queue behind a burst of long-running agent activities.

Fix: confirm the task's steps array is non-empty and points at the runbook step. If a run is genuinely orphaned, cancel it by age rather than re-running, since a re-run can create duplicate work.

401 Unauthorized

Cause: a missing or stale token.

Fix: check JETTY_API_TOKEN in your environment, then ~/.config/jetty/token, then Settings → API keys. If a fresh token still 401s, stop and check the collection and base URL before assuming the token is the problem.

A step errors on a parameter it should accept

Cause: activity config shape mismatches. A few to know:

litellm_chat takes prompt or prompt_path, not a bare string somewhere else.
litellm_batch wants requests / requests_path and returns .results (each request is a dict like { messages: [...] }), even though older docs say prompts / .responses.
File uploads land at init_params.file_paths[N]; wire a step's input to that path, not to an invented init_params.image_path.

Fix: check the activity's real config in the step library, or fetch it with get-step-template.

Polling a trajectory always returns no status

Cause: using the wrong id. The trajectory id is the short hex suffix after the -- in the workflow id, not the full workflow id. The wrong id silently returns a null status forever.

Fix: poll with the suffix, or list trajectories for the task and read the id off the result.

A run flakes once, then works on re-run

Cause: transient per-run conditions: a sandbox that briefly lacked a forwarded credential (a live fetch 403s), or a runtime that occasionally returns an empty completion on the first turn.

Fix: re-run once. If it's reproducible, check connected accounts & secrets and confirm the provider key is present on the collection.

Still stuck? Open the trajectory and read the failing step's inputs and outputs, where the trace almost always shows the cause. Background on what you're looking at: Trajectories & evaluation.