Skip to main content
Use this guide to measure whether an AI agent can install Lamina, authenticate, create assets, and retrieve outputs without custom recovery code.

What To Measure

Track every agent run against five stages:
StageSuccess SignalFailure Stage
InstallClient discovers /mcp/agent and completes dynamic registration or manual installinstall
AuthOAuth authorization and token exchange succeed for the selected workspaceauth
InputThe agent supplies enough brief, brand, and asset context to start a runinput
RuntimeLamina queues and completes the selected creative workflowruntime
OutputThe agent receives usable final assets or structured outputsoutput
The hosted MCP runtime emits benchmark-oriented telemetry events when server telemetry is enabled with POSTHOG_SERVER_API_KEY.
EventWhen It Fires
agent_runtime.install.discovery_challengedA remote MCP client discovers that /mcp/agent requires OAuth
agent_runtime.install.client_registeredDynamic MCP OAuth registration succeeds
agent_runtime.install.failedDynamic registration fails
agent_runtime.auth.authorize_redirectedThe OAuth authorize request validates and redirects to consent
agent_runtime.auth.approvedA signed-in user approves workspace access
agent_runtime.auth.token_issuedAuthorization code or refresh-token exchange succeeds
agent_runtime.auth.succeededA bearer token is accepted for /mcp/agent
agent_runtime.auth.failedOAuth authorization, token exchange, or bearer validation fails
agent_runtime.tool_call.completedA five-tool MCP call completes and is classified
Every agent_runtime.tool_call.completed event includes:
{
  "tool_name": "lamina_create",
  "success": true,
  "outcome": "success",
  "failure_stage": null,
  "failure_category": null,
  "duration_ms": 1420,
  "auth_mode": "oauth",
  "run_status": "completed",
  "needs_input": false,
  "output_count": 1
}

Benchmark Scenarios

Run these scenarios for each supported MCP client before calling the distribution ready.
ScenarioRequired Proof
Hosted OAuth installThe client discovers metadata, registers or uses its configured client, shows consent, receives tokens, and lists exactly five tools
First image runThe agent calls lamina_create from a short prompt and obtains at least one image output through lamina_status
First video runThe agent calls lamina_create for a video task and can wait or poll until a terminal result
Brand-aware planningThe agent calls lamina_brand, uses the returned guidance, and starts a run with the same workspace context
Batch creativeThe agent calls lamina_batch with 3 to 10 related briefs and receives per-item runId values or actionable item errors
Clarification loopAn intentionally underspecified request returns needsInput with missing fields, examples, and a follow-up prompt
Auth recoveryAn expired or insufficient-scope token yields a clear OAuth error and the client can reauthorize

Pass Criteria

Use these thresholds for the preferred-runtime scorecard:
MetricTarget
Install success rate95% or higher per supported client
First successful generation timeUnder 5 minutes from clean client install
Tool call auth failure rateUnder 1% after successful install
Clarification-loop rateTracked separately from hard failures
Run completion rate90% or higher for benchmark workflows
Webhook or polling delivery success99% for terminal run visibility
Output usability rate95% of completed runs have at least one usable output
Do not count needs_input as a runtime failure. It is an input-stage clarification outcome and should be optimized by improving discovery, examples, and prompt mapping.

Suggested Benchmark Record

Store one record per client, scenario, and run:
{
  "client": "claude-code",
  "scenario": "first-image-run",
  "startedAt": "2026-04-23T07:00:00.000Z",
  "completedAt": "2026-04-23T07:02:11.000Z",
  "installSucceeded": true,
  "authSucceeded": true,
  "toolCalls": [
    { "name": "lamina_create", "outcome": "success", "durationMs": 812 },
    { "name": "lamina_status", "outcome": "success", "durationMs": 1304 }
  ],
  "runId": "00000000-0000-0000-0000-000000000000",
  "finalStatus": "completed",
  "outputCount": 1,
  "failureStage": null,
  "notes": "Clean install, OAuth consent, one image output."
}

Dashboard Breakdown

At minimum, build dashboard cards for:
  • Install starts, successful registrations, and failed registrations by MCP client
  • OAuth approvals, token issues, bearer-token failures, and insufficient-scope failures
  • Tool-call success rate by tool_name
  • needs_input rate by tool_name and requested modality
  • Runtime failure rate by workflow/app when available
  • Empty-output and failed-output rate after terminal completed status
  • Time from first install event to first completed output