the yardstick · cost per correct answer

Does that tool actually earn its cost?

copeca A/B-compares a coding agent with a tool — an MCP server, a model swap, a harness change — against a clean baseline, on one objective number: the dollar cost of getting a correct answer. Neutral, reproducible, verifiable.

get copeca view source

pip install copeca · Python · MIT

copeca — report

$ copeca run --task scenario.yaml --runner claude running 52 tasks × 2 arms (baseline / +ripgrep-mcp)… Cost Per Correct Answer ────────────────────────────────────────── baseline tool delta overall $0.031 $0.019 −38.7% 95% CI [−44%, −33%] Per-Capability Breakdown ────────────────────────────────────────── locate $0.018 $0.010 −42.2% trace $0.027 $0.017 −37.0% fix $0.048 $0.044 −8.3% debug $0.036 $0.027 −25.0% artifact: .copeca/run-2026-06-21.json (signed) ⚠ numbers illustrative

the gap

tool demos show capability. not value.

Does an MCP server, a model swap, or a harness change actually lower the cost of correct answers — or does it just add tokens and latency?

the accuracy trap

accuracy alone ignores cost

A tool that is a little more accurate but 3× the tokens can lose on cost-per-correct. An accuracy leaderboard tells you what passes, not what is worth paying for.

the cost trap

cost alone ignores correctness

A cheaper model that fails more tasks is not cheaper in any meaningful sense. Cost-per-correct divides the bill by the answers that actually count — it is the honest unit.

the metric

one number, no tricks.

cost-per-correct is the total dollars spent, divided by the number of correct answers. Every parameter in the denominator is graded deterministically — no LLM judge in the scoring path.

cost-per-correct = total $ spent ÷ correct answers

A tool that is a little more accurate but 3× the tokens can lose on cost-per-correct. The metric makes this visible and comparable across runs.

clean isolation, honest grading

define. run. grade. read.

Four steps from a scenario file to a signed, per-capability report.

step 01

define→

Describe the scenario: tasks × modes × models. Each task names the information required — never the method — so no tool is privileged in the prompt.

step 02

run→

copeca runs each arm in an isolated git worktree — a clean baseline vs the tool under test, tool-restriction enforced per arm. A validity gate then confirms the tool was actually used, so no result claims a win the tool never produced.

step 03

grade→

Deterministic grading only: a string rubric or a test-command exit code. Never an LLM judge in the scoring path. The grade is the same every time you run it.

step 04

read

A per-capability cost-per-correct delta with bootstrap confidence intervals — and when the interval crosses zero, copeca says so plainly: no significant effect. See exactly where the tool helps, and where it doesn’t justify its cost. A control set of tool-neutral tasks confirms a win is real — not a regression or mere specialization elsewhere.

neutral · reproducible · verifiable

results you can share, and contest.

Three properties that make a copeca report trustworthy enough to publish.

pillar 01

Neutral

Tasks name the information required, never the method — so no tool is privileged in the prompt. An agnosticism lint enforces this. Config-driven multi-CLI: the same scenario runs unmodified against claude or codex.

pillar 02

Reproducible

Deterministic grading, repos pinned to exact commits, bootstrap confidence intervals, no LLM judge in the scoring path. Run the same scenario twice — the grade is the same. The CI tells you how tight the estimate is.

pillar 03

Verifiable

Signed .copeca artifacts (Ed25519) with integrity manifests and batch verification. The vendor’s billed cost is the headline, frozen into the artifact — with a token × price computed cross-check.

where does the tool help?

the capability taxonomy.

Every task is tagged with a capability, so the report shows where a tool helps — not just an overall average. A tool that lifts locate but not fix tells a very different story than one that lifts everything uniformly.

locate

Finding where something lives in the codebase — file, function, or symbol.

trace

Following a call chain, data flow, or control path across files.

fix

Applying a targeted, correct change to the code without side-effects.

debug

Identifying the root cause of a test failure or unexpected output.

the corpus

52 tasks · four codebases

ripgrep (Rust) · gin (Go) · express (JavaScript) · fastapi (Python) — provenance-tracked and contamination-screened. Each codebase contributes tasks across all four capability tags.

the honest answer

is copeca for you?

copeca runs a controlled experiment and hands back a single comparable number. The experiment takes real time and real API spend — so:

reach for copeca when — a fit

you are choosing between MCP servers, models, or harness changes
you want one objective, comparable number — not a vibe-check
you publish results others can reproduce and verify
you need to know which capability benefits, not just whether the average moves

look elsewhere when — not this

you want a quick demo that looks good in a slide deck
you only care about accuracy, regardless of what it costs
a single leaderboard number is enough for your decision

start

install and run your first comparison.

install

pip install copeca

also: pipx install copeca for an isolated environment

a minimal run

# run a scenario against the claude runner copeca run --task scenario.yaml --runner claude # or against the codex runner copeca run --task scenario.yaml --runner codex # verify a previous signed artifact copeca verify .copeca/run-2026-06-21.json

where copeca is

runners

2 claude · codex

corpus

52 tasks

grading

deterministic

Numbers above are illustrative of the current state and move as copeca does. Follow along on GitHub →