Keeping AI agents under control doesn't seem very hard
Techniques we use with human agents should work just fine.
I intended for this to be the final installment of my Agent Week series in June but it took a little longer than I expected. Hope you enjoy it!
Some people are very concerned about the growing capabilities of AI systems. In a 2024 paper, legendary AI researchers Yoshua Bengio and Geoffrey Hinton (along with several others) warned that humanity could “irreversibly lose control of autonomous AI systems” leading to the “marginalization or extinction of humanity.”
The emergence of sophisticated AI agents over the last year has intensified these concerns. Autonomous AI systems are no longer a theoretical possibility—they exist now and they are increasingly being used for real work.
In the AI safety community, discussions of AI risk typically focus on “alignment.” It’s taken for granted that we will cede more and more power to AI agents, and the debate focuses on how to ensure our AI overlords turn out to be benevolent.
In my view, a more promising approach is to just not cede that much power to AI agents in the first place. We can have AI agents perform routine tasks under the supervision of humans who make higher-level strategic decisions.
But AI safety advocates argue that this is unrealistic.
“As autonomous AI systems increasingly become faster and more cost-effective than human workers, a dilemma emerges,” Bengio and Hinton wrote in their 2024 paper. “Companies, governments, and militaries might be forced to deploy AI systems widely and cut back on expensive human verification of AI decisions, or risk being outcompeted.”
I think this reflects a failure of imagination.
Organizations will not face a stark choice between turning control over to AI systems or missing out on the benefits of AI. If they’re smart about it, they can have the best of both worlds.
After all, this is not remotely a new problem for humanity. People delegate tasks to other people all the time. When we do, this there is always a risk of misalignment—the person we hire might pursue their own goals at our expense. But human societies have developed a rich menu of techniques for monitoring and supervision. Those techniques are neither costless nor flawless, but they work well enough that most of us feel comfortable hiring others to do work on our behalf.
Supervising an AI agent poses different challenges than supervising another human being, but the challenges aren’t that different. Many of the techniques we use to supervise other human beings will also work with AI agents. Some techniques will actually work better when applied to AI agents.
My claim is not that AI agents will never cause harm. Some organizations will of course fail to adequately supervise their AI agents. Adapting to AI agents will require the same kind of trial-and-error learning process as any other new technology. But there’s no reason to think delegating work to AI agents puts us on a slippery slope to an AI takeover.
How to control an AI agent
I thought about this question a lot as I was working on this article about coding agents. I was testing out Claude Code, an agent that works by executing commands on my local machine. In principle, a misconfigured agent like this could cause a wide range of problems, from deleting important files to installing malware.
To help prevent this kind of harm, Claude Code asked for permission before taking potentially harmful actions. But I didn’t find this method for supervising Claude Code to be all that effective.
When Claude Code asked me for permission to run a command, I often didn’t understand what the agent wanted to do or why. And it quickly got annoying to approve commands over and over again. So I started giving Claude Code blanket permission to execute many common commands.
This is precisely the dilemma Bengio and Hinton warned about. Claude Code doesn’t add much value if I have to constantly micromanage its decisions; it becomes more useful with a longer leash. Yet a longer leash could mean more harm if it malfunctions or misbehaves.
Doomers see this dilemma becoming more and more acute as AI agents get more powerful. In a couple of years, AI agents might be able to do hours or even days of human work in a few minutes. But (the doomers say) we will only be able to unlock this value if we take slow-witted humans out of the loop. So over time, AI agents will become more and more autonomous, and humans will become less and less able to control—or even understand—what they are doing.
Fortunately, step-by-step approvals aren’t the only possible strategy for overseeing AI agents. Claude Code provides another important safeguard: the agent only has access to files in the specific directory where it is supposed to be working. This means that Claude Code can’t muck around with my system files or look at my emails. It also means that it’s relatively safe to automatically approve commands that involve reading and editing files, since Claude only has access to the files it’s supposed to be editing.
Other coding agents, including OpenAI’s Codex and Google’s Jules, use a more powerful version of this same technique: each instance of the agent runs in a separate cloud-based virtual machine. This means users don’t need to review or approve individual actions because the agent can’t affect the world beyond its sandbox.
This is an example of a general security principle that organizations use constantly with both humans and software: the principle of least privilege. When you start a job at a company, you might be given a key that only unlocks your own office, not those of your co-workers. Your login credentials will give you access to files you need to do your job, but it probably won’t give you access to every document across the company.
It’s hard to apply this principle too strictly to people because human workers often perform many different tasks over the course of a workweek. It can be hard to anticipate which resources a human worker might need, so overly strict rules can harm productivity.
It’s easier to rigorously limit the privileges of AI agents because you can create a separate copy of the agent for each task it is supposed to work on. You can put each instance into a separate virtual sandbox that provides access to exactly the information and resources the agent needs. In most cases, this won’t include any real user data.
And once you’ve set up a sandbox environment like this, you don’t need to closely monitor the agent’s actions. You can just wait for it to finish the task and see if you like the results. If not, just delete that copy of the agent and try again with a different prompt.
Review changes before deploying them
Of course, once a coding agent has written some code we want to use that code in a real-world setting. This creates additional risks. Fortunately, many organizations already have rigorous processes for evaluating code and deploying it safely. This can include the following steps:
Every change to a company’s code is tracked in a version control system such as Git. This means that harmful edits can be rolled back quickly and easily.
A suite of tests automatically verifies that new code meets the organization’s requirements. If these tests are written well, buggy updates will often be detected and rolled back before they are pushed into production.
Proposed changes are reviewed by an experienced developer, who can reject a change or request modifications.
Code is pushed out gradually. Employees or volunteer beta testers might get access before the general public. Or the code might initially get pushed out to one percent of users to see if it causes any problems. If this initial rollout goes well, the new version can be gradually released to more users.
These processes were not invented with AI agents in mind. They were designed to catch mistakes by human programmers. But it turns out that reviewing the work of an AI agent isn’t that different from reviewing the work of a human programmer. And so many organizations already have infrastructure that will help them keep AI coding agents under control.
Of course, not all organizations take all of the precautions I listed above. When a company is deciding how rigorous to make their code review process, it is fundamentally making a cost-benefit tradeoff. On the one hand, code reviews take up valuable programmer time. On the other hand, pushing buggy code into production can be very costly.
And this tradeoff looks different for different companies. A video game startup might skip formal code review altogether. A large hospital system, on the other hand, might require months of testing and review before pushing out a software update.
These tradeoffs will look different once companies have access to powerful coding agents. For example, coding agents are good at writing tests, so it will be a no-brainer for companies to expand their automated testing suites (or create them if they don’t exist yet). AI agents should also create opportunities for new types of tests—for example, an AI agent might play the role of a user and do end-to-end testing of new software inside a virtual machine.
As AI agents take over routine coding tasks, human programmers can spend more time doing code reviews. The result could be higher productivity and a more rigorous approach to releasing software.
This works in the real world too
Because software is purely digital, software engineers tend to have unusually formal and rigorous review processes. But people take similar approaches to other high-stakes decisions:
Before a large company introduces a new product, executives will typically produce a series of memos describing various aspects of the product rollout, from manufacturing to marketing.
Before a company builds a new factory, architects and engineers will produce detailed blueprints and technical drawings that will be shared with interested parties inside the company as well as contractors and government officials.
Before a lawyer files an important motion, she may ask for feedback from colleagues with relevant expertise, then ask the client to review it.
Before a medical scientist begins a clinical study, he will typically write detailed project proposals to share with funders, colleagues, and an Institutional Review Board.
In short, it’s very common for organizations to break high-stakes decisions into three steps: writing a proposal, reviewing the proposal, and then executing. The goal is to help relevant stakeholders understand the proposal well enough to provide meaningful feedback before it is put into practice.
There’s no reason to expect this basic decision-making pattern to change dramatically in a world of powerful AI agents. It’s a safe bet that AI agents will be increasingly involved in creating proposals—drafting blueprints, legal briefs, marketing materials, and so forth. AI agents can help make these proposals more detailed and rigorous. And many stakeholders will use AI agents to help them evaluate proposals during the review phase.
But people involved in high-stakes decisions are not going to stop expecting detailed proposals they can read and critique before the actual decision gets made. They are going to demand that these proposals be comprehensible to human beings.
Moreover, the people responsible for signing off on high-stakes decisions tend not to be in a big hurry. If you talk to software engineers at big banks or health care providers, they’ll tell you that they could be a lot more productive if their organizations were less bureaucratic and risk-averse. But decision-makers at these organizations have good reasons to be cautious. Often they face serious financial, professional, and legal consequences if they sign off on a bad decision. And they know they won’t get fired for holding too many meetings or asking for too much documentation.
And this seems relevant to Bengio and Hinton’s argument that organizations will “risk being outcompeted” if they don’t “cut back on expensive human verification of AI decisions.” It’s already the case that many organizations could be dramatically more productive if they cut back on oversight and bureaucracy. But in many industries—especially high-stakes industries like finance, health care, pharmaceuticals, aviation, car manufacturing, and so forth—the upside to greater speed and efficiency just isn’t that large relative to the regulatory and reputational downsides of making a serious mistake.
I could see the emergence of AI agents nudging some of these organizations toward more risk-taking at the margin. But it simply isn’t plausible that competitive pressures are going to force them to abandon human review altogether. Nor are decision makers going to accept the excuse that because a superhuman AI wrote a proposal, it’s too sophisticated for humans to understand.
A big reason I don’t expect powerful AI agents to dramatically change how organizations make decisions is that the third phase of the propose-review-execute process almost always happens at the speed of the physical world.
Maybe in a few years we’ll have AI agents that can create a perfect blueprint for a new factory in five minutes. But actually building the factory is still going to take months, if not years. So spending a week or even a month reviewing the proposal isn’t going to make a big difference for when the factory opens. On the other hand, it might allow the company’s leadership to improve the design in ways that will pay dividends for years into the future.
But isn’t it possible we’ll reach a point where AI agents are so smart that humans can never improve on the designs AI agents come up with? No, because good decision-making isn’t just about intelligence, it’s also about context and values. Often people are able to give better, more detailed feedback once they see a concrete proposal.
Once a company’s CEO sees a blueprint for a new factory, she might realize that she forgot to include an important design goal in her original instructions to the AI system. Other interested stakeholders—from contractors to local government officials—might similarly have helpful feedback. This feedback may be based on private information (for example, private conversations with customers, contractors, or policymakers) that wasn’t known to the person who drew up the initial proposal.
So no matter how smart an AI agent is, it’s unlikely to get every detail right on the first try, which means it will always make sense to seek feedback from interested humans before putting a plan into action.
It doesn’t matter (very much) if AI agents are capable of scheming
In recent months there has been a lot of attention paid to studies showing that AI agents are capable of lying to users and pursuing hidden agendas. Perhaps the best known of these studies was published by Apollo Research in December:
Our results show that o1, Claude 3.5 Sonnet, Claude 3 Opus, Gemini 1.5 Pro, and Llama 3.1 405B all demonstrate in-context scheming capabilities. They recognize scheming as a viable strategy and readily engage in such behavior. For example, models strategically introduce subtle mistakes into their responses, attempt to disable their oversight mechanisms, and even exfiltrate what they believe to be their model weights to external servers.”
This study did not find that it was common for leading models to deceive their users in real-world applications. Instead, researchers found they could get models to deceive their users in contrived situations where the models had explicitly conflicting goals.
Here’s what they did: they gave a model one goal in its system prompt (for example, protect the environment) and a different goal in the user prompt (for example, maximize short-term profits). They found that when models were placed into this kind of awkward situation, they would sometimes ignore the user’s instructions and even lie to the user about what it was doing.
To be clear, I don’t mean this as a criticism of the research. I think Apollo did interesting work and exposed an important fact about today’s AI models. I don’t agree with those who say these situations are so contrived that we should dismiss the results.
At the same time, if we’re thinking about the real-world consequences of these findings, it’s important to remember that human societies have lots of experience dealing with human beings with a capacity for scheming and deception. It’s not uncommon for people in positions of authority to make decisions for self-interested reasons. This is why organizations have a variety of oversight mechanisms to make sure decisions are made in the best interest of the company, not in the best interests of the person making the decision.
So I don’t mean to dismiss scheming AI agents as a potential problem. It’s just a more familiar and manageable problem than many doomers think. Society already has robust mechanisms to limit the damage that can be done by human decision makers with ulterior motives. Variants of those same mechanisms should work just fine to limit the damage that can be done by AI agents with ulterior motives.
To be clear, the scenario I’m critiquing here—AI gradually gaining power due to increasing delegation from humans—is not the only one that worries AI safety advocates. Others include AI agents inventing (or helping a rogue human to invent) novel viruses and a “fast takeoff” scenario where a single AI agent rapidly increases its own intelligence and becomes more powerful than the rest of humanity combined.
I think biological threats are worth taking seriously and might justify locking down the physical world—for example, increasing surveillance and regulation of labs with the ability to synthesize new viruses. I’m not as concerned about the second scenario because I don’t really believe in fast takeoffs or superintelligence.
One idea I’ve been thinking about lately is really leaning into MCP as a data boundary—if the AI agent uses tools to access data and system permissions, we be designing the access carefully in tools rather than just making them wrappers.
The slightly longer version is here: https://open.substack.com/pub/harrydeanhudson/p/use-mcp-tools-as-a-data-fence
💯. Management is the original solution to the alignment problem. And so far, as technology has improved, the percentage of the workforce in managerial roles has only increased: https://www.2120insights.com/i/150163373/management
I'm also reminded of these figures from Google where even though 30% of code is AI-generated, engineering velocity has only increased 10%: https://x.com/krishnanrohit/status/1933010655965294944
Even if they are really beating that METR RCT, deciding what you want and then validating/accepting the output is more work than many people assume!
Maybe more of this gets automated over time, but at a high enough level, the buck is always going to stop somewhere with a human, unless AI agents are somehow given property rights (but I don't see this ever being a popular political view).