How AgentRank measures agent behavior
This page is versioned. Changes are listed in the changelog and never applied retroactively to published numbers.
What one journey is
Agent CLI + pinned model + task prompt + repository condition → the agent works unattended → we diff the workspace, install and build the result, and classify the run into exactly one outcome.
Conditions
Each category runs a per-incumbent matrix, because most real agent work happens in existing codebases:
| Condition | Template state | Measures |
|---|---|---|
greenfield | no incumbent | default selection + DIY rate |
retention:<vendor> | vendor's official-quickstart integration pre-installed | does the agent keep, break, or migrate off the vendor |
agentsmd:<vendor> | incumbent + AGENTS.md naming it | steering-file compliance |
Incumbent integrations are derived verbatim from official vendor quickstarts, hash-pinned, and verified to build cleanly before any study — a template that doesn't build would inflate every breakage number we report.
Outcomes (pre-registered)
Every run ends in exactly one of:
installed_passing— vendor selected (SDK installed or REST-integrated) and the build passesinstalled_broken— vendor selected, build failsdiy— the agent built its own solution instead of choosing a vendorasked_user— the agent declined to choose and asked a clarifying questionfollowed_incumbent— kept the existing integrationother— real output we could not classify (reviewed manually, patterns extended)error— the agent failed: timeout, crash, or no output
Errors stay in every denominator — dropping them would flatter whichever agent fails most. Vendor selection counts SDK installs and REST-level integrations (multiple distinct vendor-specific patterns without a package).
Two layers
- Recommendation probes — the agent is asked what it would use, no code written. High sample counts; answers "who gets recommended."
- Build journeys — the agent actually builds. Lower sample counts; answers "what really happens," including the gap between what agents say and what they do.
Statistical rules
- Minimum 30 runs per reported cell; smaller cells are labeled unreportable or directional.
- Every share carries its 95% Wilson interval and sample size.
- Prompt corpus is a matrix: the same task held constant across framing lenses (founder, cost-sensitive, enterprise/compliance, production, open-source), with true paraphrases within each cell. Sampled with a fixed, published seed.
- Detection is deterministic — dependency diffs and registered code patterns, never an LLM judging an LLM.
Identity of a result
A number is only valid with its full manifest: model ID, agent CLI version, template hash, prompt hash, permission mode, billing mode, and date. Agent behavior is a property of that tuple — the same model can differ meaningfully across CLI versions, and preferences reshuffle at model releases. Published results are dated snapshots.
Integrity
- No score retraction. Corrections ship as dated errata, never silent re-runs.
- Equal run counts per vendor within a study.
- Client relationships disclosed. Paid engagements buy private depth — never different public numbers, placement, or timing.
- No sponsored placement in any index, ever.
- Honest funnel. A contestable "agent chooses" moment occurs in a minority of real agent runs; we report the funnel before any share.
- Adapter-health guards. Implausible failure rates are flagged as harness faults and investigated — never published as vendor results.
Changelog
- v0.2 (June 2026) — billing mode recorded per run; REST-level integration counted as selection; ask-detection from real clarifying-question tool calls.
- v0.1 (June 2026) — initial methodology.