AI Agent Capability Audit · 4–6 Weeks · $25–100K
Which of your AI agents are actually doing the work?
A capability audit that evaluates your deployed or candidate agents against your tasks, computes Shapley-based attribution across multi-agent pipelines, and tells you exactly which agents to keep, cut, or replace — with compliance-ready artifacts where regulators care.
When the capability audit pays for itself
Five vendors, no principled way to compare.
Each vendor's pitch deck claims best-in-class. Each vendor's benchmark is the one they win. You need an apples-to-apples score against your tasks, not theirs.
Pipeline outcomes are good, contributions are murky.
Your multi-agent workflow is producing acceptable results — and you have no idea which agent is doing the work. When one of them gets expensive or unreliable, you can't tell what removing it costs.
Risk and compliance want explainability you don't have.
Model risk, MiFID II, SR 11-7, FDA reviewers — they need per-component attribution and verifiable capability claims. Vendor self-attestation isn't enough.
What you get
The method, briefly
Each agent has a latent capability profile — a vector across the dimensions that matter for your tasks. We measure it through observed outcomes, not vendor claims. The measurements form a comparable capability space across vendors, including open-weight alternatives.
For multi-agent pipelines, we apply Shapley value decomposition: a mathematically principled way to attribute marginal contribution per agent. This is the same technique the Nobel-winning game theory uses to allocate value fairly across coalitions; we apply it to your AI pipeline so you can see which agents earn their cost.
Privacy stays intact throughout. Where cross-vendor comparison would otherwise require exposing prompts or outputs, we use quantized Johnson-Lindenstrauss projections so that vendors compare on shape, not content. Read more in our research →
How the engagement runs
Scoping
Identify the agents in scope, the task dimensions that matter to your business, and the outcome data we'll use as ground truth. Define success criteria before measurement begins.
Data collection & instrumentation
Light-weight instrumentation of the agent pipeline (we don't ask you to re-architect anything). Capture inputs, outputs, costs, latencies, and outcome signals. Build the capability vectors.
Attribution analysis
Compute Shapley decomposition across the multi-agent pipeline. Cross-validate against held-out outcome data. Identify capability gaps, redundancies, and outright failures.
Report & recommendations
Findings deck with per-agent verdict (keep/cut/replace), quantified impact estimates, and compliance-ready artifacts. Read-out sessions for your sponsor, risk team, and procurement.
Scope & pricing
Quick
3–5 agents, single task domain, no pipeline attribution
- Per-agent capability vectors
- Side-by-side comparison against your tasks
- Failure-mode taxonomy
- Keep/cut/replace recommendations
Standard
5–15 agents, 2–3 task domains, basic pipeline attribution
- Everything in Quick
- Shapley attribution across single pipeline
- Multi-domain capability comparison
- Privacy-preserving cross-vendor analysis
- Two read-out sessions
Enterprise
15+ agents, multiple domains, full pipeline attribution, compliance artifacts
- Everything in Standard
- Full Shapley decomposition across all pipelines
- Compliance-ready artifacts (SR 11-7, MiFID II, FDA)
- Implementation support for keep/cut/replace decisions
- 30 days of post-engagement Slack/email support
Frequently asked questions
What is an AI agent capability audit?
A structured evaluation of your AI agents — whether candidates you're considering or systems already in production — that produces per-agent capability vectors across the task dimensions that matter to you. For multi-agent pipelines, we compute Shapley-based attribution so you can see each agent's marginal contribution to outcomes. The deliverable is a report telling you which agents to keep, cut, or replace, and why.
Who should commission a capability audit?
Enterprises evaluating 5–15+ AI agents from different vendors with no principled way to compare them. Financial services firms with model risk requirements (MiFID II, SEC, OCC). Healthcare/pharma operations with explainability needs (FDA). Any organization running multi-agent workflows where it's unclear which agent is creating value and which is freeloading on the others.
How is Shapley-based attribution different from agent-level benchmarks?
Public benchmarks score agents in isolation against synthetic tasks. Shapley attribution measures each agent's marginal contribution to your actual production pipeline against your real outcomes. Two agents that look identical on benchmarks can have wildly different Shapley values in your specific workflow — and that's the number that matters for keep/cut/replace decisions.
Do you require access to our agents' internals?
No. We work from observable inputs, outputs, and outcomes. Where we need to compare embeddings or capability profiles across agents from different vendors, we use privacy-preserving projections (quantized Johnson-Lindenstrauss) so that vendor IP and your sensitive data both remain protected.
What's the difference between this and the Value Thread Audit?
The Value Thread Audit is broader and upstream — it maps your entire data → BI → AI spend to outcomes, agents or no agents. The Capability Audit is narrower and downstream — it dissects multi-agent systems specifically. Many engagements run the value audit first and the capability audit second when warranted.
Can the audit support regulatory or compliance reporting?
Yes. The capability vectors and Shapley attribution diagnostics are designed to map cleanly to model risk frameworks (SR 11-7 for US banking, MiFID II, FDA model explainability requirements). We deliver the audit artifacts in a form your compliance and risk teams can incorporate into their existing reporting.