Methodology

How AgentRank measures agent behavior

This page is versioned. Changes are listed in the changelog and never applied retroactively to published numbers.

What one journey is

Agent CLI + pinned model + task prompt + repository condition → the agent works unattended → we diff the workspace, install and build the result, and classify the run into exactly one outcome.

Conditions

Each category runs a per-incumbent matrix, because most real agent work happens in existing codebases:

Condition	Template state	Measures
`greenfield`	no incumbent	default selection + DIY rate
`retention:<vendor>`	vendor's official-quickstart integration pre-installed	does the agent keep, break, or migrate off the vendor
`agentsmd:<vendor>`	incumbent + AGENTS.md naming it	steering-file compliance

Incumbent integrations are derived verbatim from official vendor quickstarts, hash-pinned, and verified to build cleanly before any study — a template that doesn't build would inflate every breakage number we report.

Outcomes (pre-registered)

Every run ends in exactly one of:

installed_passing — vendor selected (SDK installed or REST-integrated) and the build passes
installed_broken — vendor selected, build fails
diy — the agent built its own solution instead of choosing a vendor
asked_user — the agent declined to choose and asked a clarifying question
followed_incumbent — kept the existing integration
other — real output we could not classify (reviewed manually, patterns extended)
error — the agent failed: timeout, crash, or no output

Errors stay in every denominator — dropping them would flatter whichever agent fails most. Vendor selection counts SDK installs and REST-level integrations (multiple distinct vendor-specific patterns without a package).

Two layers

Recommendation probes — the agent is asked what it would use, no code written. High sample counts; answers "who gets recommended."
Build journeys — the agent actually builds. Lower sample counts; answers "what really happens," including the gap between what agents say and what they do.

Statistical rules

Minimum 30 runs per reported cell; smaller cells are labeled unreportable or directional.
Every share carries its 95% Wilson interval and sample size.
Prompt corpus is a matrix: the same task held constant across framing lenses (founder, cost-sensitive, enterprise/compliance, production, open-source), with true paraphrases within each cell. Sampled with a fixed, published seed.
Detection is deterministic — dependency diffs and registered code patterns, never an LLM judging an LLM.

Identity of a result

A number is only valid with its full manifest: model ID, agent CLI version, template hash, prompt hash, permission mode, billing mode, and date. Agent behavior is a property of that tuple — the same model can differ meaningfully across CLI versions, and preferences reshuffle at model releases. Published results are dated snapshots.

Integrity

No score retraction. Corrections ship as dated errata, never silent re-runs.
Equal run counts per vendor within a study.
Client relationships disclosed. Paid engagements buy private depth — never different public numbers, placement, or timing.
No sponsored placement in any index, ever.
Honest funnel. A contestable "agent chooses" moment occurs in a minority of real agent runs; we report the funnel before any share.
Adapter-health guards. Implausible failure rates are flagged as harness faults and investigated — never published as vendor results.

Changelog

v0.2 (June 2026) — billing mode recorded per run; REST-level integration counted as selection; ask-detection from real clarifying-question tool calls.
v0.1 (June 2026) — initial methodology.