Your AI agent just ran up $800 in API costs overnight because a loop didn't terminate cleanly. You found out when you opened your billing dashboard in the morning.
If this hasn't happened to you yet, it will. Unbounded AI agents are a billing vulnerability. And the standard advice — "monitor your usage carefully" — is not a solution. It's a cope.
Here's how to set a hard token budget cap that actually stops execution before it blows through your limit, using TokenFence, part of the ZentriTools developer toolkit.
The Problem: Agents Don't Know When to Stop
LLM APIs are stateless. They don't know your monthly budget. They don't know this is the fourth time your agent has called them in the past 30 seconds because of a retry loop. They process the request and charge you.
This means budget enforcement has to happen at the application layer — in your code, before the API call, not after it.
Most developers handle this with one of three approaches, all of which are inadequate:
Soft monitoring — Checking usage dashboards periodically. Catches problems after the damage is done.
Provider-level limits — Most cloud providers let you set soft spending alerts or hard limits, but these operate at the account level, not the agent or project level. A runaway agent can burn your entire month's budget before the alert triggers.
Manual token counting — Adding token estimation logic to each agent individually. Works, but it's boilerplate that needs to be maintained across every project, every model update, every new agent you ship.
TokenFence takes a different approach: it wraps your LLM calls with a budget-aware middleware that tracks token consumption per agent session, per project, or per arbitrary scope you define — and hard-stops execution when the limit is hit.
Two Lines of Code
Here's what hard budget enforcement looks like with TokenFence:
import { TokenFence } from '@zentritools/tokenfence';
const fence = new TokenFence({ budget: 50000, model: 'gpt-4o' });
// Your existing LLM call — unchanged
const response = await fence.call(() => openai.chat.completions.create({
model: 'gpt-4o',
messages: conversation,
}));
That's it. The fence.call() wrapper:
- Estimates token cost before the call based on your conversation context
- Checks remaining budget
- Executes the call if budget is available
- Tracks actual token consumption from the response
- Throws a
BudgetExceededErrorif the call would exceed your limit
The error is typed and catchable, so you can handle it gracefully — return a "session limit reached" message to the user, log the event, alert your team, or trigger a budget review workflow.
Setting Budget Scopes
The most powerful feature of TokenFence is scoped budgeting. You're not limited to a single global cap — you can set budgets at whatever granularity your architecture needs.
Per-user session:
const fence = new TokenFence({
budget: 10000,
model: 'gpt-4o',
scope: `user:${userId}`
});
Per agent:
const fence = new TokenFence({
budget: 100000,
model: 'claude-3-5-sonnet',
scope: 'agent:research-pipeline'
});
Per project with daily reset:
const fence = new TokenFence({
budget: 500000,
model: 'gpt-4o-mini',
scope: 'project:catalog-processor',
resetInterval: '24h'
});
Scopes use Redis under the hood for distributed state, so they work correctly across multiple instances and serverless environments. Single-process environments fall back to in-memory tracking automatically.
Handling Budget Exceeded Gracefully
A hard stop is only useful if your application handles it cleanly. TokenFence exports a typed error class for this:
import { TokenFence, BudgetExceededError } from '@zentritools/tokenfence';
try {
const response = await fence.call(() => /* your llm call */);
} catch (err) {
if (err instanceof BudgetExceededError) {
// Remaining budget, amount attempted, scope info all available
console.log(`Budget cap hit. Used: ${err.used} / ${err.budget} tokens`);
return { error: 'session_budget_exceeded', remaining: 0 };
}
throw err; // re-throw non-budget errors
}
For agent loops, you can also use the non-throwing API to check budget before entering a new iteration:
while (agentShouldContinue) {
const canProceed = await fence.check(estimatedTokens);
if (!canProceed) {
await notifyBudgetLimitReached();
break;
}
// proceed with agent step
}
Why Not Just Use Provider Limits?
This question comes up. Provider-level spend limits are a safety net, not a control plane. Here's the practical difference:
- Provider limits operate on dollar spend with a billing lag — they often don't trigger until the billing cycle catches up
- They're account-wide, not scoped to individual agents or projects
- They don't give you programmatic control over what happens when the limit hits
- They don't integrate with your application's error handling or user experience
TokenFence gives you deterministic, application-level control. You decide what happens when a budget is hit: graceful user message, alert, fallback to a cheaper model, or full stop. That's the difference between infrastructure control and billing hope.
Get Started
TokenFence is available on npm and is part of the ZentriTools developer toolkit — a collection of utilities for building cost-controlled, secure AI agents. Also in the toolkit: MCP Scout (MCP server discovery) and AgentGuard (AI agent security).
npm install @zentritools/tokenfence
Full documentation and examples at zentritools.com/tokenfence. The free tier covers up to 3 scopes — no credit card required to start.
ZentriTools is part of the Digitech Online Solutions portfolio.