Why AI Agent Costs Spiral

Line chart of AI agent spend rising steeply then flattening when it hits a cost cap, with a dotted red projection showing the runaway path the cap prevented — The red line is a runaway loop. The cap stops it dead and the bill goes flat, instead of following the dotted path to an empty account.

There is a story that gets told in slightly different forms across every team running AI agents this year. Someone wires up an agent, it works, they leave it running. A task that should have taken five model calls gets stuck in a loop and takes five hundred. Nobody is watching the billing dashboard at two in the morning. By the time anyone looks, the month's budget is gone, spent by a single agent doing the same thing over and over because nothing told it to stop.

Engineers have reported personal token bills of several hundred to a couple thousand dollars a month, and whole teams burning through a full year of planned AI budget in weeks. The model is not the problem. The problem is that an autonomous system with a credit card and no hard limit will, eventually, spend without bound.

Why costs spiral in the first place

Agent cost has a property that makes it dangerous: it compounds. A normal API call is one request and one charge. An agent call is a loop. It thinks, acts, observes, and thinks again, and every turn is billable. When the task goes well, that is a handful of turns. When the agent gets confused, retries, or loops on a goal it cannot reach, the turns keep coming and each one costs money.

Three patterns drive almost all of the damage. The retry loop, where a step keeps failing and the agent keeps trying, each attempt a fresh set of expensive calls. The fan out, where one instruction spawns many parallel calls that multiply faster than anyone expected. And the silent long task, where the agent quietly grinds for hundreds of turns on something nobody is watching. None of these look alarming on turn one. By turn two hundred they are a budget emergency.

Why monitoring is not enough

The common answer is a billing dashboard and alerts. Watch the spend, get a notification when it crosses a threshold. The trouble is that monitoring is a rearview mirror. It tells you what already happened. By the time an alert fires, the money is already spent, because the alert and the spend are the same event seen a moment apart.

Alerts also depend on someone being there to act on them. An agent running overnight, on a weekend, or in a background job does not pause itself because a Slack message went out. Monitoring is useful for understanding cost after the fact. It is not a control. A control stops the thing before it happens.

The fix is a cap before the call

The reliable answer is to put a hard limit in front of every spend, not behind it. Before a model call runs, estimate what it will cost. If that cost is over the ceiling, reject the call instead of making it. The agent gets a clear refusal, the same way it would handle any other failed step, and the money is never spent.

This is the difference between a budget you watch and a budget you enforce. A watched budget can be blown through in the time between two dashboard refreshes. An enforced budget cannot be exceeded at all, because the call that would exceed it never executes. The cap is not advice to the agent. It is a wall in front of the spend.

Where the cap has to live

Here is the part teams get wrong. They try to make the agent responsible for its own budget. Add a line to the prompt: do not spend more than fifty dollars. This does not work, for the same reason you would not let someone audit their own expense report. The agent has no reliable sense of cumulative cost, and a confident model under pressure will route around a soft instruction or simply lose track of it after a hundred turns.

The cap has to live below the agent, in the layer that actually makes the call. That layer can see the real estimated cost of each request, compare it to the ceiling, and refuse. Because the agent never gets to make the call, it cannot talk its way past the limit, forget it, or loop around it. The enforcement and the spending happen in the same place, which is the only place enforcement is reliable.

What a good cap looks like

A useful cost control has a few properties. It is a hard cap, not a warning, so the call over the line is rejected rather than logged. It runs per call, so no single request and no runaway loop can blow the budget in one shot. It estimates before executing, so the decision happens before any money moves. And it fails closed, meaning if the cost cannot be confirmed as safe, the call does not run.

With those in place, the worst case changes completely. A broken loop that would have drained your credits instead hits the ceiling and stops. You wake up to a rejected call and a flat bill, not an empty account. The agent stays useful for everything under the limit and harmless for everything over it.

Cost is a safety property

It is tempting to treat runaway spend as a billing annoyance rather than a real risk. That is a mistake. An autonomous system that can spend without bound is a system that can hurt you, the same way one that can reach your internal network or ship broken code can hurt you. We wrote about those two in what MCP security actually takes, and the budget cap belongs in the same family of controls. Capability without a limit is the danger. The limit is what makes the capability safe to leave running.

An agent should be something you can trust unattended. That trust does not come from hoping it behaves. It comes from a wall it cannot cross, in front of the spend, every time.

Read next

Pillar

How We Reduced AI Agent Token Costs by 96.5%

Three real API calls, every number published. The math behind moving work out of the context window.

AI Workflows That Outlast the Model

Models have an 18 month half life. The architectural pattern that survives the model race.

Free tool

Estimate your own saving

Plug in your numbers to see what your tokens cost per year and how much you keep by moving the heavy work out of the chat.

Queries per day

Tokens per query

Workload type

Enter your queries per day and tokens per query to see the saving.

Tokens saved per year

Token cost avoided

UniversalBench cost

Net saving per year

Estimate only, at $3.00 per million input tokens. The 96.5 percent figure is a measured result on a data-heavy log task, not a guarantee for every query.

Open the full calculator →