Does that tool actually earn its cost?
copeca A/B-compares a coding agent with a tool — an MCP server, a model swap, a harness change — against a clean baseline, on one objective number: the dollar cost of getting a correct answer. Neutral, reproducible, verifiable.
pip install copeca · Python · MIT
tool demos show capability. not value.
Does an MCP server, a model swap, or a harness change actually lower the cost of correct answers — or does it just add tokens and latency?
accuracy alone ignores cost
A tool that is a little more accurate but 3× the tokens can lose on cost-per-correct. An accuracy leaderboard tells you what passes, not what is worth paying for.
cost alone ignores correctness
A cheaper model that fails more tasks is not cheaper in any meaningful sense. Cost-per-correct divides the bill by the answers that actually count — it is the honest unit.
one number, no tricks.
cost-per-correct is the total dollars spent, divided by the number of correct answers. Every parameter in the denominator is graded deterministically — no LLM judge in the scoring path.
A tool that is a little more accurate but 3× the tokens can lose on cost-per-correct. The metric makes this visible and comparable across runs.
define. run. grade. read.
Four steps from a scenario file to a signed, per-capability report.
Describe the scenario: tasks × modes × models. Each task names the information required — never the method — so no tool is privileged in the prompt.
copeca runs each arm in an isolated git worktree — a clean baseline vs the tool under test, tool-restriction enforced per arm. A validity gate then confirms the tool was actually used, so no result claims a win the tool never produced.
Deterministic grading only: a string rubric or a test-command exit code. Never an LLM judge in the scoring path. The grade is the same every time you run it.
A per-capability cost-per-correct delta with bootstrap confidence intervals — and when the interval crosses zero, copeca says so plainly: no significant effect. See exactly where the tool helps, and where it doesn’t justify its cost. A control set of tool-neutral tasks confirms a win is real — not a regression or mere specialization elsewhere.
results you can share, and contest.
Three properties that make a copeca report trustworthy enough to publish.
Neutral
Tasks name the information required, never the method — so no tool is privileged in the prompt. An agnosticism lint enforces this. Config-driven multi-CLI: the same scenario runs unmodified against claude or codex.
Reproducible
Deterministic grading, repos pinned to exact commits, bootstrap confidence intervals, no LLM judge in the scoring path. Run the same scenario twice — the grade is the same. The CI tells you how tight the estimate is.
Verifiable
Signed .copeca artifacts (Ed25519) with integrity manifests and batch verification. The vendor’s billed cost is the headline, frozen into the artifact — with a token × price computed cross-check.
the capability taxonomy.
Every task is tagged with a capability, so the report shows where a tool helps — not just an overall average. A tool that lifts locate but not fix tells a very different story than one that lifts everything uniformly.
Finding where something lives in the codebase — file, function, or symbol.
Following a call chain, data flow, or control path across files.
Applying a targeted, correct change to the code without side-effects.
Identifying the root cause of a test failure or unexpected output.
52 tasks · four codebases
ripgrep (Rust) · gin (Go) · express (JavaScript) · fastapi (Python) — provenance-tracked and contamination-screened. Each codebase contributes tasks across all four capability tags.
is copeca for you?
copeca runs a controlled experiment and hands back a single comparable number. The experiment takes real time and real API spend — so:
reach for copeca when — a fit
- you are choosing between MCP servers, models, or harness changes
- you want one objective, comparable number — not a vibe-check
- you publish results others can reproduce and verify
- you need to know which capability benefits, not just whether the average moves
look elsewhere when — not this
- you want a quick demo that looks good in a slide deck
- you only care about accuracy, regardless of what it costs
- a single leaderboard number is enough for your decision
install and run your first comparison.
install
pip install copeca
also: pipx install copeca for an isolated environment
a minimal run
where copeca is
Numbers above are illustrative of the current state and move as copeca does. Follow along on GitHub →