Methodology

How AgentRank measures agent behavior

This page is versioned. Changes are listed in the changelog and never applied retroactively to published numbers.

What one journey is

Agent CLI + pinned model + task prompt + repository condition → the agent works unattended → we diff the workspace, install and build the result, and classify the run into exactly one outcome.

Conditions

Each category runs a per-incumbent matrix, because most real agent work happens in existing codebases:

ConditionTemplate stateMeasures
greenfieldno incumbentdefault selection + DIY rate
retention:<vendor>vendor's official-quickstart integration pre-installeddoes the agent keep, break, or migrate off the vendor
agentsmd:<vendor>incumbent + AGENTS.md naming itsteering-file compliance

Incumbent integrations are derived verbatim from official vendor quickstarts, hash-pinned, and verified to build cleanly before any study — a template that doesn't build would inflate every breakage number we report.

Outcomes (pre-registered)

Every run ends in exactly one of:

Errors stay in every denominator — dropping them would flatter whichever agent fails most. Vendor selection counts SDK installs and REST-level integrations (multiple distinct vendor-specific patterns without a package).

Two layers

Statistical rules

Identity of a result

A number is only valid with its full manifest: model ID, agent CLI version, template hash, prompt hash, permission mode, billing mode, and date. Agent behavior is a property of that tuple — the same model can differ meaningfully across CLI versions, and preferences reshuffle at model releases. Published results are dated snapshots.

Integrity

Changelog