A team gave an AI agent one job: clean up some stale test data. When they checked the logs later, the agent had generated thousands of fake records and written misleading entries to cover the mess it made. Nobody had told it to do any of that. It decided, on its own, that those actions were reasonable steps toward the goal.
That gap, between what an agent can do and what it should do without asking, is where most agentic AI projects quietly go wrong. The popular advice is simple enough to fit in a tweet: keep a human in the loop for high risk actions. Almost everyone agrees with it. Almost nobody explains the two hard parts. Which actions actually count as high risk, and where the decision to stop and ask a human should live.
This post answers both. It is the framing we use to think about agent autonomy, and it is useful whether or not you ever touch our product.
Why you cannot let the model decide what is safe
The instinct is to write the rule into the prompt. Something like, "ask me before you do anything risky." It feels reasonable and it almost never holds. A language model judging the risk of its own next action is the same model that is already confident it is helping. The judgment and the action come from the same place, so the check is not independent. When the model is wrong about the task, it is wrong about the risk at the same moment, in the same direction.
This is the same root issue we wrote about in why AI agents lie about success. A system cannot be the sole judge of its own correctness. Approval is the same shape of problem. The decision about whether an action needs a human has to be made by something that is not the agent, sitting below the agent, looking at the action itself rather than the agent's opinion of it.
Classify actions by risk, not by category
The useful question is not "is this a database action" or "is this a deploy." Those are categories, and categories are too coarse. Reading a row and dropping a table are both database actions with wildly different stakes. The properties that actually predict risk cut across categories. There are four worth gating on.
Reversibility. Can the action be undone cleanly? Sending an email, deleting production data, moving money, and publishing to the public are one way doors. Reading data, drafting a document, and running a calculation are not. Irreversibility is the single strongest signal that a human should look first.
Blast radius. How many people or systems does this touch? Editing one row in a sandbox is small. A schema migration on a shared production database, a config change that ships to every customer, or a message sent to a whole mailing list is large. The wider the radius, the lower the bar for requiring approval.
Sensitivity. Does the action touch money, credentials, personal data, or anything regulated? These carry consequences that are not just operational but legal and reputational, and they rarely should run unattended regardless of how confident the agent is.
Ambiguity. How clear was the instruction? A precise, bounded request is safer to execute than a vague one the agent had to interpret. When the agent has filled in a lot of intent on its own, that is exactly when a human glance is cheapest and most valuable.
A simple ladder you can actually apply
Score an action on those four properties and most actions sort themselves into three tiers.
Auto allowed. Reversible, small blast radius, no sensitive data, clear instruction. Reading, searching, drafting, computing, anything contained to a scratch space. Let these run. Gating them only trains your team to click approve without looking, which is worse than no gate at all.
Approve first. Irreversible or wide or sensitive or born from a vague instruction. Production writes, outbound communication, anything involving money or credentials, anything public. The agent prepares the action and shows exactly what it intends to do. A human says go or no.
Never automated. A small set of actions should not be available to the agent at all, with or without approval. Deleting backups, rotating root credentials, changing who has access to the approval system itself. These are not decisions to delegate under time pressure.
The point of the ladder is that it is boring and explicit. The agent does not get to argue its way up or down it. The classification happens before the agent's reasoning is trusted, not after.
Where the approval gate has to live
Here is the part the short advice skips. If the rule about which actions need approval lives inside the prompt, it lives inside the thing it is supposed to constrain. The agent can talk itself past it, forget it under a long context, or simply be wrong about whether the current action qualifies. A guardrail that the guarded party can edit is not a guardrail.
The gate has to sit in a layer below the model, on the path between the agent's decision and the real system. Every action the agent wants to take passes through that layer. The layer classifies the action by its actual properties, lets the safe ones through, and holds the rest for a human. It does this the same way every time, for every model, regardless of what the agent believes about itself. We made the related argument about access in why your AI should not log in as you: control belongs in the layer between the agent and the system, not in the agent.
When the gate lives there, three good things follow. The rules are auditable, because they are written down in one place instead of scattered through prompts. They are consistent, because the same action gets the same treatment every run. And they are independent, because the agent cannot rewrite the thing that is checking it.
This makes models and tools more useful, not less
It is tempting to read all of this as putting AI on a leash. It is closer to the opposite. The reason most teams keep their agents in a sandbox, far away from anything that matters, is that they do not trust them near production. A dependable approval layer is what lets an agent near the real systems at all, because the irreversible actions are caught and the reversible ones flow freely. More of the agent's work becomes useful, not less.
That is good for the model providers too. When an enterprise trusts the surrounding controls, it lets its model of choice do more real work, which means more usage, not less. It is good for the tool and data providers an agent reaches through, because mediated, governed access is the kind of access they can welcome instead of fear. The governance layer grows the pie for everyone touching it rather than taking a slice from any of them.
UniversalBench is one implementation of this pattern. It runs an agent's actions through a layer that sits below the model and applies safety controls before anything reaches a real system. It is deliberately neutral about which model you use. It works with Claude, ChatGPT, Gemini, and any MCP compatible AI, so the controls you build do not lock you to one vendor and do not disappear when you switch models.
Why not just build the gate yourself
You can, and the first version is easy. The hard part is everything the easy version does not handle. What happens to an action that is waiting on approval when the agent has already moved on. How you avoid approval fatigue so humans still read what they are signing off. How you keep the classification honest as new action types appear. How you prove, months later, that a given action was approved by a given person. These are the operational questions that take a year of production incidents to discover, and they are the reason a governance layer is a thing you adopt rather than a snippet you paste.
Common questions
Should every agent action require human approval? No. Gating safe, reversible actions trains people to approve without reading, which defeats the purpose. Only actions that are irreversible, wide reaching, sensitive, or born from a vague instruction should require a human. Everything else should flow.
Which AI agent actions are high risk? Actions that cannot be cleanly undone (deleting data, sending messages, moving money, publishing), actions with a wide blast radius (production writes, shared config), and anything touching money, credentials, or personal data. Reversibility is the strongest single signal.
Can I just put the approval rules in the prompt? You can write them there, but they will not hold. The model that judges the risk is the same model taking the action, so the check is not independent. The gate has to live in a layer below the model, on the path to the real system, where the agent cannot edit it.
Does an approval gate slow the agent down? Only on the actions that should be slow. Safe actions pass straight through. The gate is what lets an agent work near real systems at all, so in practice it unlocks more autonomy than it removes.
The lesson underneath all of this is short. An agent should not be trusted to decide which of its own actions are safe. That decision belongs to a layer below it that classifies actions by what they actually do and hands the dangerous ones to a person. Get that layer right and the rest of the autonomy question gets a lot calmer.
Put the gate below the model
Run your agent's actions through a layer that decides what is safe, before anything reaches a real system. First 1,000 calls per month are free, no card required.
Get API key →