Back to Research

Why Your AI Spend Audit Is Lying to You — The Attribution Problem

Most enterprise AI spend audits add up costs and call it a day. The number you actually need — which dollar produced which dollar of value — requires solving a harder problem. Here is how Shapley-based attribution does it.

AI Spend · Attribution · Shapley · Enterprise AI

You've seen the slide. The CFO put it up last quarter. AI tooling spend, broken out by vendor, summed, growth-rate annotated, plotted on a clean bar chart. There is a tidy number at the bottom: total annual AI spend.

That number is true. It is also useless for the question you actually need to answer: which of those dollars produced which dollars of business value?

Most "AI spend audits" stop at the first number. The second number is harder. It requires solving the attribution problem.

The aggregate is not the answer

A finance-side audit sums up:

  • Foundation-model API spend (OpenAI, Anthropic, Bedrock)
  • Vector database costs (Pinecone, Weaviate, pgvector at scale)
  • Agent platform / orchestration spend (LangChain hosted, CrewAI Pro, your own)
  • BI and analytics seats that are now AI-augmented (Tableau Pulse, Mode, ThoughtSpot)
  • Data infrastructure that exists primarily to feed models (feature stores, embedding pipelines)
  • Internal headcount allocated to AI initiatives

Add them up and you get a defensible number. Allocate it across business units and you get a defensible chart. Show it to the audit committee and you've discharged your reporting duty.

You still don't know whether AI is paying for itself.

The attribution problem, briefly

Imagine a single workflow: a customer-service triage pipeline. It uses (a) an embedding model to vectorize incoming tickets, (b) a retrieval step against a knowledge base, (c) an LLM call to generate a candidate response, (d) a second LLM call to validate the response against policy, and (e) a human-in-the-loop reviewer for edge cases. The pipeline produces measurable outcomes: deflection rate, time-to-resolution, CSAT delta, escalation rate.

Now: which of those five components produced the outcome?

If you remove (c) and use a cheaper model, deflection drops by 4 points. If you remove (d), policy violations show up in 0.8% of responses but deflection holds. If you remove (a) and use lexical retrieval, retrieval recall drops 30% but the LLM recovers most of it. The components interact. Their values are not independent.

This is the cooperative game theory problem that Lloyd Shapley won the Nobel for solving. The Shapley value of a player in a cooperative game is the player's average marginal contribution across all possible coalitions. Applied to a multi-component AI pipeline, it tells you how much value each component is producing, weighted across every counterfactual configuration of the others.

The math is straightforward. The bookkeeping is not — for n components, exact Shapley calculation requires evaluating 2^n coalitions, and sampling approximations are the practical answer. But the result is honest: a per-component attribution that respects interaction effects.

What this means in practice

When you compute Shapley attribution across the pipeline above, three things tend to fall out:

  1. One component carries disproportionate weight. Often it's a retrieval-quality component (the one that nobody got promoted for shipping). Removing it tanks the pipeline. Keep it; resource it.

  2. One component is freeloading. It looked important in the architecture diagram. The Shapley value says it's adding marginal value of approximately zero. Cut it or replace it; the savings often pay for the audit several times over.

  3. One component is exactly as valuable as it looks. Confirming this is itself worth money — you can now defend the line item in the budget meeting with attribution data, not vibes.

These three patterns reproduce across enterprise AI workflows. The specifics differ. The shape of the answer doesn't.

Why most audits skip this

Three reasons:

  • Instrumentation. Computing Shapley values requires you to log inputs, outputs, costs, latencies, and outcome signals at every component boundary in the pipeline. Most production AI systems don't. Adding the instrumentation is straightforward but non-zero work — and outside the scope most consulting firms quote.

  • Counterfactual evaluation. You need to be able to re-run the pipeline (or a representative sample of it) with components ablated. If your only "AI infrastructure" is a stack of vendor APIs you can't introspect, ablation requires careful test-set construction.

  • Math. Shapley value computation isn't hard, but it's outside the skill set of most management consultants, and the agents-only ML crowd often doesn't think about pipelines as cooperative games.

The good news: none of these is a serious blocker. The instrumentation lift is days, not months. The counterfactual evaluation can be done on held-out outcome data without re-running production. The math is well-understood and there are honest sampling approximations.

What an honest audit produces

A real AI spend audit — one that actually answers "which dollar produced which dollar" — outputs three artifacts:

ArtifactFormatDecisions it supports
Per-component capability vectorNumerical, ~5–10 dimensionsVendor selection, model swaps, infrastructure changes
Shapley attribution tablePer-pipeline, per-componentKeep / cut / replace, budget reallocation
Counterfactual sensitivity analysisWhat-if scenarios on outcomesInvestment prioritization, risk assessment

You don't need to publish these to the board. You need them on your desk, in front of the same CFO who showed the aggregate spend slide, the next time someone asks why a particular line item is so large.

The privacy footnote

If your pipeline spans multiple vendors and you want to compare them apples-to-apples, you face a side problem: comparing capability profiles often requires comparing embeddings, prompts, or outputs that contain sensitive content. You don't want to send vendor A's outputs to vendor B.

There's a clean solution. Quantized Johnson-Lindenstrauss projections preserve pairwise distances in a much lower-dimensional space, with privacy guarantees. Each vendor publishes a projection of their capability profile against your task set; comparison happens in projected space; no raw outputs leave anybody's environment. This is the same technique that lets cross-vendor attribution scale to real enterprise comparisons without violating anyone's data agreements.

We'll cover the privacy-enhancing tech stack in more depth in a separate piece.

What to do Monday

If you've spent more than $500K annually on AI tooling and can't answer the "which dollar produced which dollar" question:

  1. Instrument one pipeline. Pick the highest-spend production AI workflow. Add boundary logging. Two engineers, two weeks.
  2. Run a small ablation study. On held-out outcome data, evaluate the pipeline with each component independently swapped or removed. Compute approximate Shapley values.
  3. Decide one thing. Use the results to make exactly one budget decision — kill, swap, or scale a component. Track outcome delta over the next quarter.

Repeat for the next pipeline. Compound.

If you'd rather have someone else do steps 1–2 with you, that's what our Value Thread Audit and Capability Audit engagements are for. The Value Thread Audit covers the whole data → BI → AI stack; the Capability Audit goes deep on multi-agent pipelines specifically.

Either way: stop showing the aggregate slide. It's true and it's useless.