Agent Runtime Benchmarks

Use this guide to measure whether an AI agent can install Lamina, authenticate, create assets, and retrieve outputs without custom recovery code.

What To Measure

Track every agent run against five stages:

Stage	Success Signal	Failure Stage
Install	Client discovers `/mcp/agent` and completes dynamic registration or manual install	`install`
Auth	OAuth authorization and token exchange succeed for the selected workspace	`auth`
Input	The agent supplies enough brief, brand, and asset context to start a run	`input`
Runtime	Lamina queues and completes the selected creative workflow	`runtime`
Output	The agent receives usable final assets or structured outputs	`output`

The hosted MCP runtime emits benchmark-oriented telemetry events when server telemetry is enabled with POSTHOG_SERVER_API_KEY.

Event	When It Fires
`agent_runtime.install.discovery_challenged`	A remote MCP client discovers that `/mcp/agent` requires OAuth
`agent_runtime.install.client_registered`	Dynamic MCP OAuth registration succeeds
`agent_runtime.install.failed`	Dynamic registration fails
`agent_runtime.auth.authorize_redirected`	The OAuth authorize request validates and redirects to consent
`agent_runtime.auth.approved`	A signed-in user approves workspace access
`agent_runtime.auth.token_issued`	Authorization code or refresh-token exchange succeeds
`agent_runtime.auth.succeeded`	A bearer token is accepted for `/mcp/agent`
`agent_runtime.auth.failed`	OAuth authorization, token exchange, or bearer validation fails
`agent_runtime.tool_call.completed`	A five-tool MCP call completes and is classified

Every agent_runtime.tool_call.completed event includes:

{
  "tool_name": "lamina_create",
  "success": true,
  "outcome": "success",
  "failure_stage": null,
  "failure_category": null,
  "duration_ms": 1420,
  "auth_mode": "oauth",
  "run_status": "completed",
  "needs_input": false,
  "output_count": 1
}

Benchmark Scenarios

Run these scenarios for each supported MCP client before calling the distribution ready.

Scenario	Required Proof
Hosted OAuth install	The client discovers metadata, registers or uses its configured client, shows consent, receives tokens, and lists the Lamina tools
First image run	The agent calls `lamina_create` from a short prompt and obtains at least one image output through `lamina_status`
First video run	The agent calls `lamina_create` for a video task and can wait or poll until a terminal result
Brand-aware planning	The agent calls `lamina_brand`, uses the returned guidance, and starts a run with the same workspace context
Batch creative	The agent calls `lamina_batch` with 3 to 10 related briefs and receives per-item `runId` values or actionable item errors
Clarification loop	An intentionally underspecified request returns `needsInput` with missing fields, examples, and a follow-up prompt
Auth recovery	An expired or insufficient-scope token yields a clear OAuth error and the client can reauthorize

Pass Criteria

Use these thresholds for the preferred-runtime scorecard:

Metric	Target
Install success rate	95% or higher per supported client
First successful generation time	Under 5 minutes from clean client install
Tool call auth failure rate	Under 1% after successful install
Clarification-loop rate	Tracked separately from hard failures
Run completion rate	90% or higher for benchmark workflows
Webhook or polling delivery success	99% for terminal run visibility
Output usability rate	95% of completed runs have at least one usable output

Do not count needs_input as a runtime failure. It is an input-stage clarification outcome and should be optimized by improving discovery, examples, and prompt mapping.

Suggested Benchmark Record

Store one record per client, scenario, and run:

{
  "client": "claude-code",
  "scenario": "first-image-run",
  "startedAt": "2026-04-23T07:00:00.000Z",
  "completedAt": "2026-04-23T07:02:11.000Z",
  "installSucceeded": true,
  "authSucceeded": true,
  "toolCalls": [
    { "name": "lamina_create", "outcome": "success", "durationMs": 812 },
    { "name": "lamina_status", "outcome": "success", "durationMs": 1304 }
  ],
  "runId": "00000000-0000-0000-0000-000000000000",
  "finalStatus": "completed",
  "outputCount": 1,
  "failureStage": null,
  "notes": "Clean install, OAuth consent, one image output."
}

Dashboard Breakdown

At minimum, build dashboard cards for:

Install starts, successful registrations, and failed registrations by MCP client
OAuth approvals, token issues, bearer-token failures, and insufficient-scope failures
Tool-call success rate by tool_name
needs_input rate by tool_name and requested modality
Runtime failure rate by workflow/app when available
Empty-output and failed-output rate after terminal completed status
Time from first install event to first completed output

Overview

Install

Concepts

API Reference

Guides

Agent Runtime Benchmarks

What To Measure

Benchmark Scenarios

Pass Criteria

Suggested Benchmark Record

Dashboard Breakdown

​What To Measure

​Benchmark Scenarios

​Pass Criteria

​Suggested Benchmark Record

​Dashboard Breakdown

​Related Guides

What To Measure

Benchmark Scenarios

Pass Criteria

Suggested Benchmark Record

Dashboard Breakdown

Related Guides