External agent adoption testing

See where real external agents break on your product.

Gauntlet tests whether real external agents can actually use your product from its docs, APIs, and tools.

M
I
C
A
R
8 agent personas · parallel execution
gauntlet run · gauntlet_042
LIVE
steel.dev / session-apiclaude-sonnet-4-6
trace · impatient
1GET/llms.txt200 OK
2POST/v1/sessions201 Created
3POST/v1/sessions429 Rate Limit
agent abandoned — no retry logic
attributiondocs_gaphigh

Rate limit behavior undocumented in llms.txt. No retry guidance. Agent abandoned without recovery.

Trusted by agent-facing teams

Steel
Early design partner
The blind spot

You built for agents.
But can they actually use it?

Every agent-facing team assumes their product is reachable. Most are wrong in ways that only appear when real external agents try to adopt it from scratch.

Looks fine in demos

Your product works perfectly when you run it. You control the docs, the flow, and the inputs. External agents encounter a completely different surface.

Fails in the wild

Real agents call your API in unexpected order, misread ambiguous docs, skip auth steps, hit rate limits with no retry guidance, and abandon on first failure.

You find out too late

By the time a customer tells you their agent can't adopt your product, you've already lost the integration — and you still don't know exactly why it broke.

Where external-agent adoption breaks

Ambiguous docsMissing llms.txt guidanceAuth flow assumptionsRate limit behaviorRetry contractState orderingError message qualityTool contract driftRecovery pathsConcurrency handling
How it works

Four steps from context to clarity

No setup overhead. Feed your docs, run the gauntlet, get a report that tells you exactly what to fix.

01

Ingest product context

Feed Gauntlet your llms.txt, llms-full.txt, OpenAPI specs, tool contracts, and any docs. It shapes this into a retrievable knowledge base.

llms.txt · OpenAPI · docs · tools
02

Run agent workflows

Eight built-in agent personas run realistic workflows against your product surface — each with different behavior styles, knowledge slices, and failure vectors.

8 personas · parallel execution
03

Capture traces and evidence

Every tool call, retry, decision, and failure is recorded as a structured trace. Nothing is inferred — everything is observed and logged.

step-by-step · full tool log
04

Judge and report

Gauntlet attributes each failure to docs, product behavior, tooling, or model limits. You get a report with specific evidence and concrete fixes.

attributed · actionable · HTML + JSON
Output

Concrete evidence, not just pass/fail

A completed run is not necessarily a good run. Gauntlet surfaces clean success, recovered success, suspect success, and hard failure — with full evidence for each.

Execution traces

Step-by-step records of every tool call, decision, and result for each agent run. Full fidelity, no summarization.

Failure attribution

Each failure is categorized: docs gap, product behavior, tooling issue, or model capability. Know which layer broke.

Reproduction paths

Exact steps to reproduce each failure. Give your team something concrete to fix, not a vague failure report.

Cross-persona analysis

See which agent types succeed and which fail — and why the same product produces different outcomes by persona.

Evidence-backed fixes

Recommendations anchored to specific trace evidence and doc citations. No guessing. No generic advice.

HTML + JSON reports

Human-readable dashboard for your team. Machine-readable JSON for your CI pipeline. Both generated automatically.

One command. One investigation bundle.

artifacts/gauntlet_runs/gauntlet_042/ · layer2_run_*.json · gauntlet_042.html · gauntlet_042.json

See a sample report
Differentiation

This is not an eval.
This is not observability.

Gauntlet occupies a category that didn't exist before: testing the product surface that agents consume, from the perspective of a real external agent trying to adopt it cold.

Generic agent evals

Test model quality and task capability

Gauntlet

Test product quality and agent adoption readiness

Observability tools

Capture what happened in production

Gauntlet

Find what will break before production, with layer-level attribution

Benchmark scores

Measure model performance on curated tasks

Gauntlet

Measure whether your product surface is usable by real external agents

Manual testing

A human QA engineer covers ~20 scenarios

Gauntlet

8 distinct agent personas run in parallel against your actual product surface

Persona engine

Eight agents, eight failure modes

These are not cosmetic variations. Each persona changes how the agent behaves, what docs it trusts, and what class of product failures become visible.

methodical

Follows docs literally

Happy path, baseline correctness

impatient

Hammers endpoints fast

Rate limits, concurrency

chaotic

Reorders and perturbs flows

Edge cases, input validation

confused

Misreads params, wrong endpoints

Doc clarity, error messages

long-running

Multi-step deep sessions

Timeouts, session continuity

adversarial

Pushes boundaries and limits

Security, payload limits

parallel

Concurrent sub-tasks

State isolation, concurrency

recovery

Fails intentionally, then recovers

Idempotency, retry logic

Who it's for

Built for products that agents consume

If your product's end users are AI agents — or agent-mediated workflows — Gauntlet is built for you.

Browser & computer-use APIs

Products like browser automation APIs, headless infra, and session managers that agents call to take actions on the web.

Browserbase · Steel.dev · similar

Tool and integration platforms

MCP servers, action APIs, and integration layers that expose real-world tools to agents. If agents call your tools, they need to work reliably.

MCP providers · Composio-style · Nango

API and docs products

Any API product that wants external agents to adopt it from docs alone. Your llms.txt is your agent-facing contract — Gauntlet tests it.

API companies · docs-first products

Internal AI platform teams

Platform teams making their internal APIs consumable by agent-mediated workflows. Find the gaps before your internal agents hit them.

Enterprise AI infra · agent-mediated workflows

FAQ

Common questions

Ready?

Know if your product is actually agent-ready.

Get a diagnostic run against your API with a real, evidence-backed report. No sales cycle required to see the output.