Six months of work. The AI model performs, the pilot delivers, and the team is ready to scale. Then the quarterly budget review comes around.
The CFO pulls up the cloud invoice on the screen. Three hundred line items. No one in the room can explain which charges belong to the AI workload and which do not.
The pilot gets paused. Again.
This plays out across industries, more often than most AI teams care to admit. The models work. The technology is there.
What keeps AI stuck in pilot mode is something far less glamorous. No one can tell the board what it costs to produce a unit of AI output.
When the cloud bill arrives
Starting AI in the cloud makes sense. You pay for what you use, spin up a few GPUs, run your first experiments. That flexibility is real, especially when you are still figuring out your use case and data.
The trouble starts when workloads grow. What looked like a simple GPU rental turns into a monthly invoice stacked with extras. Data transfer fees, storage tiers, orchestration overhead, networking charges, and software licensing are all layered on top of the GPU cost you thought you were paying.
Guy D’Hauwer, a technology scaling advisor with over 20 years of experience across European telecom and cloud markets, compares it to old mobile phone bills. Eight pages of line items, zero clarity on what you actually got for your money.
“You thought you were renting a few GPUs. Then you open the invoice and find charges you never planned for.” – Guy D’Hauwer.
According to Sander ten Hoedt, who leads Cisco’s datacenter, cloud, and AI infrastructure portfolio across Benelux, cloud costs can grow by a factor of thirty once AI workloads move past the pilot stage. At that point, any business case falls apart.
Tokens as a management metric
Walk into most AI budget meetings, and you hear GPU hours, compute time, and cloud subscriptions. Those are input numbers. They tell you what you spent.
They tell you nothing about what came out the other end.
A more useful number is the cost per token. Tokens are the measurable output of an AI system, the smallest unit of work your model delivers. Know what each token costs to generate, and the budget conversation with finance changes.
Instead of “we need more GPU capacity,” you walk into the boardroom and say, “this application generates X tokens per euro, and each token represents Y minutes of analyst work saved.”
That is when AI stops being an open-ended line item and starts looking like a production metric the board can steer on.
GPUs waiting on data
There is a cost leak in most AI setups that has nothing to do with GPU pricing. It sits in the layers around the GPU.
In practice, many AI infrastructures run at barely thirty percent of their capacity. Picture a row of GPUs, powered on, drawing energy, producing nothing. Storage cannot deliver data fast enough.
Network bandwidth was never sized for AI traffic. The compute layer is ready. The rest of the infrastructure is not.
This happens in both cloud and on-premises setups. Teams invest heavily in GPUs and undersize everything around them. When GPUs wait, you pay for hardware that sits idle.
In the cloud, the reflex is to add more GPUs. That multiplies the bill without fixing anything upstream.
As Guy D’Hauwer puts it, the bottleneck is almost never the GPU itself. It is everything that surrounds it.
Without observability across the full stack, from network throughput and storage I/O to tokens generated per second, these bottlenecks stay invisible. And invisible bottlenecks become invisible cost leaks.
From GPU purchase to production line
The term AI Factory exists for a reason. In a factory, raw materials flow in, finished products flow out, and every stage of the line is measured, tuned, and predictable.
An AI infrastructure follows that same logic. Compute, networking, storage, and software need to work as one system. A reference architecture validated by Cisco and NVIDIA reduces integration risk and provides a known performance baseline for operation.
According to Sander ten Hoedt, the complexity of building an AI Factory is still widely underestimated. It is not a traditional data center with GPUs bolted on. Every layer needs to match.
- Networking must move data to GPUs without creating wait times
- Storage needs to feed models at the speed the compute layer demands
- Observability must track cost and performance end to end
When something in this production line breaks, you want one phone number, not five. Fewer vendors mean a shorter path to the root cause. And lifecycle management, including hardware refresh cycles and flexible financing through partners like Cisco Capital, keeps the line running without budget surprises.
So, when does cloud, on-premises, or hybrid make sense? Only after you know your numbers.
What does your infrastructure produce per euro? What does each token cost? Where is thirty percent of your GPU capacity sitting unused?
Organizations that answer those questions are the ones moving from pilot to production. The rest keep funding experiments they cannot defend in a board meeting, and miss the business gains that come with scaling AI confidently.
Here is a starting point. Pull your last three months of AI infrastructure invoices. Separate the GPU cost from everything layered on top of it. Storage, networking, data transfer, licensing.
Then divide the total by the number of tokens your models produced. That single number, your cost per token, tells you more about your AI readiness than any pilot report ever will.
Want to bring predictability into your AI bill for your next budget review?
Get in touch with our experts.
