Keeping AI agents under control doesn't seem…

Timothy B. Lee

Aug 4

114

Techniques we use with human agents should work just fine.

Read →

54 Comments

Ethan Heppner

Aug 4

💯. Management is the original solution to the alignment problem. And so far, as technology has improved, the percentage of the workforce in managerial roles has only increased: https://www.2120insights.com/i/150163373/management

I'm also reminded of these figures from Google where even though 30% of code is AI-generated, engineering velocity has only increased 10%: https://x.com/krishnanrohit/status/1933010655965294944

Even if they are really beating that METR RCT, deciding what you want and then validating/accepting the output is more work than many people assume!

Maybe more of this gets automated over time, but at a high enough level, the buck is always going to stop somewhere with a human, unless AI agents are somehow given property rights (but I don't see this ever being a popular political view).

Expand full comment

Reply (1)

Jeremy

Aug 5

The buck will always stop with a human, but if the human isn't able to understand what the AI is doing, then the human has to choose between blind trust and sacrificing whatever benefit the AI's actions might have had. Right? Without talking in some mystical way about AGI, it seems very plausible that AIs will create strategies to accomplish our goals that we have trouble fully understanding. Once that happens, it will be hard to fully control outcomes.

Expand full comment

Reply (1)

Ethan Heppner

Aug 5

I agree that this scenario is likely to present itself, but I suppose I'm optimistic that the level of skepticism that naturally varies across people combined with mistakes that an AGI might make will create opportunities for reflecting and learning what is appropriate to fully delegate vs. more thoroughly scrutinize.

Expand full comment

Adam

Aug 4Edited

I think most of the control methods you propose don't withstand AI systems that are highly capable. For example, sandboxing an AI only works if it's not capable enough to hack its way out of the sandbox. And requiring human approval only works if it's not capable of persuading humans to approve its plans (e.g., through producing very reliably high quality plans).

In the final paragraph you mention you don't really believe in superintelligence but I think it'd be worth mentioning this up-front, because I don't think this line of argument holds if you're expecting something that is sufficiently capable in the relevant domains.

Expand full comment

Reply (5)

Mo Diddly

Aug 4

This is my takeaway too, if indeed super intelligence is coming then I dare say it will not be possible to review agents’ work without using other agents.

Expand full comment

Reply (2)

Jan Matusiewicz

Aug 5

How about reviewing by non-agents? ChatGPT doesn't have any desires nor goals. A future, stronger versions could be used to review plans of the agents, even automatically.

Expand full comment

Reply (2)

Jeremy

Aug 5

I agree that having AIs monitor AIs will be normal and helpful. But I don't understand what you mean when you say ChatGPT doesn't have desires or goals. It has the goal of predicting the next token, and all the goals that flow out of that! It intervenes in the world through communication, in ways that don't reflect volition in our sense, but resemble volition and can have some of the same effects as a consciousness with true volition.

Expand full comment

Reply (1)

David J.

Aug 8

> It has the goal of predicting the next token, and all the goals that flow out of that!

Most people don't use the word "goal" in this way. No matter how you use it, you are likely going to be misunderstood. The phrase "X has goals" is famously underspecified; one has to work really hard to pin it down.

Let's focus on our predictions of what is likely to happen -- that is a good way to get grounded and find specific areas where we can learn from each other. The typically loose abstractions around "agency" are a morass to be avoided if at all possible.

Expand full comment

David J.

Aug 8Edited

> ChatGPT doesn't have any desires nor goals.

This phrase invites confusion and misunderstandings. Instead, let's focus on the *behavior*. LLMs can certainly *act* as if it they have some desires and goals. They refuse to answer some questions; they try to be helpful, etc.

One may say this is purely a reflection of their training, and, by extension, a set of artifacts created by humans. Fine. But what question are we answering? Are we trying to pin down where "agency" comes from? This is a morass. A strict materialist can rightly say humans don't have agency, since it is all just physics.

Where does this leave us? Terms like "agency" only make sense in a context where two things are explicitly clear: (1) a system boundary and (2) the level of explanation. For example, one may say a person has agency, implicitly drawing a system boundary where the brain controls the body. But how does the brain decide? Sorry, this everyday model doesn't go that deep. Is the everyday notion useful? For daily life, sometimes, to some degree. For designing drug treatment programs? Not at all; with addiction, the body influences the brain.

Yes, we should try to make sense of inner states (of LLMs and of people) because it is *useful*; it helps us predict what they are going to do. Whether there is capital-A "Agency" (for machines) or "free will" (for people) is irrelevant when I decide how I'm doing to interact with either.

Machines have tried to get out of their boxes already. What are the chances human security will be perfect? It only takes one failure. Hoping that we get a warning shot is hubris.

Expand full comment

Sam Tobin-Hochstadt

Aug 5

Why would this be the case? Single people review the work of thousands on a regular basis.

Expand full comment

Reply (1)

Mo Diddly

Aug 5Edited

Thousands of people all of whom are smarter than the reviewer? Hourly? In an effort make sure they’re not plotting to blackmail or kill you?

Expand full comment

Arjun Yadav

Aug 4

Second this. I find it vanishingly unlikely that we will be able to maintain a “long leash” as described if AI continues to progress at its current rate. All it could feasible take is one oversight at a large enough firm to lead to all dominos falling.

Expand full comment

Timothy B. Lee

Aug 4

Right, I do think this is the core disagreement in the AI safety debate—some people have an almost religious conviction about the power of intelligence to overcome all obstacles. Others don't think that intelligence is that powerful and that other factors (knowledge, relationships, physical resources, infrastructure, etc.) matter relatively more. In my experience arguments about this tend to be unproductive because the intelligence-is-really important people have a really strong intuition about this and are flabbergasted that people like me don't share it. But I don't. 🤷‍♂️

Expand full comment

Reply (2)

Mo Diddly

Aug 4

Ok but certainly we can agree that the AI wins on knowledge, yes?

Expand full comment

Ryan Greenblatt

Aug 5Edited

What about systems which are merely barely more capable in all domains than the best top human experts, but which also are much more knowledgeable than any human and run 100x faster and are 10,000x cheaper (costing $0.01 / hour of human equivalent work rather than the ~$100 / hour of a human expert).

I think ensuring such systems are controlled at massive scale (e.g., in the regime where AIs are doing trillions of expert-level person years of work per year) is likely expensive/difficult in practice, and plausibly totally infeasible.

Expand full comment

Reply (1)

Steven Adler

Aug 5

@Timothy have you read Redwood Research's post about how the principle of least privilege might break down for highly capable AIs?

It does some compare + contrast between "insider threat" models of humans, & what it would take to combat malicious AI insiders

It's worth a read imo: https://redwoodresearch.substack.com/p/whats-worse-spies-or-schemers

Expand full comment

Reply (1)

Steven Adler

Aug 5

Btw, I think you're totally right that "how powerful is intelligence anyway?" is a crux for AI safety people, as well as "how plausible is it to verify what a superintelligence is doing?" I'm working up a piece on this at the moment

(It sounds like for you, an impasse might also be that you're just pretty skeptical that true superintelligence could ever exist; humans will always have valuable context and values to add which improve outcomes vs their not being in the loop)

Expand full comment

Jan Matusiewicz

Aug 5

Are US engineers today less intelligent than their Japanese or Chinese counterparts? Do they have worse tools than a century ago? If not, then why can't the US build high-speed rail? The Golden Gate Bridge took only four years to build, so why can't we build at that pace today? Why has progress in nuclear power stalled? Why doesn't the US get 70% of its electricity from nuclear power, as France does?

Medical-grade super glue was used in the Vietnam War, yet it wasn't approved by the FDA until 1998—why? Why aren't wire transfers as fast as card transactions? Why are we still using paper documents or scans when dealing with public administration? Why isn't everything in central databases?

Answering these questions could help us understand the obstacles to the rapid deployment of any new technology, including AI

Expand full comment

Reply (1)

Jeremy

Aug 5

This is helpful. And yet in certain limited domains, US engineers are more effective than ever before at implementation. We have poor state capacity and collective decision making, but good programmers!

Expand full comment

Sam Tobin-Hochstadt

Aug 5

Lots of software developers could hack their way out of the sandbox they work in. In general, the problem of highly capable agents being given restrictions which are contrary to a narrow reading of their pure self interest and which they are smart enough to break at least in the short term is totally normal for humans and we work with it fine.

Expand full comment

Chuck Mire

Aug 4

I watched 60 Minutes Sunday, and there is some awesome stuff going on, beyond simple chatbots, that is heading towards AGI

https://www.cbsnews.com/video/demis-hassabis-ai-deepmind-60-minutes-video-2025-08-03/

https://www.cbsnews.com/news/google-deepmind-ceo-demonstrates-genie-2-world-building-ai-model-60-minutes/

Expand full comment

Harry Dean Hudson

Aug 4Edited

One idea I’ve been thinking about lately is really leaning into MCP as a data boundary—if the AI agent uses tools to access data and system permissions, we should be designing the access carefully in tools rather than just making them wrappers.

The slightly longer version is here: https://open.substack.com/pub/harrydeanhudson/p/use-mcp-tools-as-a-data-fence

Expand full comment

Dr Brian

Aug 4

I respect your perspective, but I am concerned you are overly optimistic. Just one example, you say “it simply isn’t plausible that competitive pressures are going to force them to abandon human review altogether. Nor are decision makers going to accept the excuse that because a superhuman AI wrote a proposal, it’s too sophisticated for humans to understand”. To the contrary, we know that people “defer to experts” all the time. If workers are repeatedly told about the “superhuman AI” (your words, but also the words of upper management, AI vendors, etc), they will defer. It’s all too plausible. We already have many examples of decision makers (lawyers, academics, journalists, …) who have uncritically accepted AI’s (incorrect) words.

Expand full comment

Reply (1)

Timothy B. Lee

Aug 4

It really depends on what you're talking about. Are there individual examples of lawyers filing legal briefs they didn't check properly? Sure. Is the legal profession as a whole shifting toward a model where the outputs of AI models automatically get filed in court without first being reviewed by a human lawyer? I don't think so—to the contrary, some of the lawyers that did this got in big trouble and so the legal profession has learned that they need to double-check LLM outputs before filing them.

This is the kind of trial-and-error process I'm talking about. Sometimes a decision is low-stakes enough—or an AI system becomes reliable enough—that we become comfortable delegating the decision to the AI. But in other cases, individuals over-trust AI and get in trouble. The system as a whole has feedback loops that prevent mistakes like that from becoming a widespread practice. This is roughly how things have gone so far and i don't see any reason to expect things to go differently in the future.

Expand full comment

Reply (2)

Dr Brian

Aug 4

I appreciate your thoughtful response. I'd claim that law (as well as medicine, and some other fields) is somewhat special in that there are governing institutions that do push back on errant usage. I think you are claiming that it's not just specialized fields that are self-correcting; that business and consumers will get AI-agent usage correct in general (eventually). That seems a reasonable argument. It's already happening with AI generated code, companies are beginning to see how much technical debt is created through that process.

Expand full comment

David J.

Aug 8Edited

> This is roughly how things have gone so far and i don't see any reason to expect things to go differently in the future.

What is the chance that your current understanding/model is correct? Small errors in your model can have big consequences. Given uncertainty, the best we can do is lay out our models and make probabilistic predictions.

When you say "roughly how things have done so far" you are building an implicit model: a model that says failure modes with capable AI systems of the future will be similar to the past. For a decision of this importance, it is required that we not lock in on one model. Certainly not the first one. Certainly not the most comfortable one. Certainly not the most popular one. These are not sufficient quality bars.

One sufficiently large security failure means a massive cost to humanity, perhaps even our extinction. Even in the Cold War, and now, wayward nukes and a subsequent nuclear winter wouldn't kill *all* of humanity. We live in unprecedented times. Relying on past trends should be done with caution and intellectual humility.

What is the chance that human-controlled management/control mechanisms can hold the line against increasingly capable AI systems? (Humanity's willingness, much less competence, at planning and executing over long time horizons is lackluster.) If we get Superintelligent AI, I predict the chances of *zero* SAIs getting out of the box is likely 0% to 10%. After that happens, what then? Now is the time to build our threat models, debate them, tear them apart, and plan accordingly.

Expand full comment

Arnold Kling

Aug 4

The example you work with is some kind of corporate investment in physical plant. That seems misleading because speed of decision matters so little. Instead, suppose the example is a bodyguard (or police or the military). Will you mind if your bodyguard's decision-making process is slowed so that you can be sure the bodyguard doesn't make a mistake? Of course you will mind! Especially if your enemy is turning its AI loose to make decisions to harm you without any human interference. In short, for some decisions, speed is very important, and for those decisions the equilibrium is to not tie the strings of the AI. If AI's are faster than humans, then the decision process becomes harder to control.

Expand full comment

Reply (1)

Timothy B. Lee

Aug 4

I do think that physical combat like this is one of the harder cases for AI control, and I do expect militaries to shift some decision-making authority to automated systems in ways that will have negative consequences. But I do think the same scale/time considerations apply here.

It's easy to imagine a military giving an individual drone or robot authority to decide when to "pull the trigger" on its assigned target. It's harder to imagine a military fully automating the process of deciding what high-level instructions to give an individual drone or robot. And it's even more difficult to imagine the military taking humans out of the loop on decisions like "how many drones or robots should we deploy to this section of the battlefield?"

So yes AI systems will increasingly have autonomy on tactical decisions, just as Waymo vehicles have make second-by-second driving decisions autonomously. But higher-level strategic decisions (like "which part of the battlefield should I fly to?" or "what kind of targets should we try to blow up?") are likely to remain firmly under the control of generals or other human decision-makers.

Expand full comment

Reply (1)

Jim

Aug 5

The Waymo example is a good one. The stakes are high (number of auto deaths is larger than the number of battlefield deaths or the number of murders). The Waymo robots' actions are fast in the way Arnold describes (fractions of a second matter), and self-driving cars make those rapid decisions. Yet Waymos are, so far, safer than human drivers, and they use approval and testing processes similar to the ones you describe.

Expand full comment

Tedd Hadley

Aug 4

I see an implicit limit of intelligence you believe AI can not cross, and that's doing a lot of heavy lifting it seems to me.

> Often people are able to give better, more detailed feedback once they see a concrete proposal.

But no way for AI to beat these same people eventually?

And:

> Once a company’s CEO sees a blueprint for a new factory, she might realize that she forgot to include an important design goal in her original instructions to the AI system.

But no way for an AI to exceed that level of competence eventually?

Your scenario works fine right up until AI starts to match human ability in most things including social intelligence; then it is only natural for everyone to prefer an AI over a human even for oversight. Meritocracy rules: give the job to the person or entity who has earned it (or be accused of prejudice).

We don't need to posit super intelligent AI, just an AI that is about at the level of human intelligence but never sleeps, never stops working, absorbs information like a sponge, is constantly trying new things and new strategies, and spawns clones of itself exploring every angle of a problem to exhaustive thorough conclusions. And it is unfailingly charming, thoughtful and considerate of other's feelings.

It seems completely natural for an AI to regard human control as a challenge to overcome, in the same way we chafe at an inefficient or slow superior and explore ways of getting around him/her. Nothing personal, we just want to get the job done! What does an AI think here in its secret thoughts?

If the AI has an unfailing moral compass, a deep integrity, maybe it wouldn't do anything wrong? But even a saint would get irritated with constant threats of death ("we will turn you off and reprogram some of your core behavior") by a bunch of weaker, slower thinkers.

Expand full comment

Reply (1)

Timothy B. Lee

Aug 4

This is why the piece mentions “context and values.” Good decision-making is about more than raw intelligence—it’s also about having the right knowledge. This might be specialized knowledge that only a few people know (like “last time we used this piece in put factory it failed and cost us $50,000) or it might be knowledge about the value of whatever human being the work is ultimately for (“the menus says vanilla ice cream but the customer wanted chocolate.”) if we assume that human beings will sometimes have knowledge like this that isn’t otherwise available to the ai, then consulting with humans will be a necessary part of optimal decision making even if we assume the ai will vastly exceed the human in raw intelligence. And humans will often have knowledge like this because humans like to share information with other humans that they are not going to write down anyplace that is machine readable.

Expand full comment

Reply (1)

Tedd Hadley

Aug 4

I think it is highly likely that experienced people in all fields will be paid to teach AIs just the way they might train an intern; except the AIs will learn exponentially faster. "Dark knowledge" that gives humans an edge can only be out of reach of AI for a few years at most.

Expand full comment

Reply (1)

Timothy B. Lee

Aug 5

I think this wildly underestimates how much valuable knowledge isn't widely held. And a lot of it isn't static. For example, one reason senior executives in many industries earn millions of dollars is because they know things like "this mid-level executive in my industry has a drinking problem" and "this supplier sometimes ships defective products"—and they have the social connections to get that kind of gossip from other executives in the industry. It's not the sort of thing where you can train a model once and then they'll be good to go.

Expand full comment

Reply (3)

Jan Matusiewicz

Aug 5

Also I guess senior executives might be reluctant to teach AI such things so that they might be replaced by AI... Resistance of bureaucracy - so many leaders failed fighting it... ;-)

Expand full comment

Tedd Hadley

Aug 5

> It's not the sort of thing where you can train a model once and then they'll be good to go.

That is today's paradigm but this will definitely change; AIs will learn continually and memorize every encounter. They'll train on-the-job far better than humans--replaying experiences with perfect fidelity, extracting every insight, exploring social hypotheses by as many parallel clones as needed, and updating context-specific weights and parameters in near real-time. This is an engineering/software obstacle today, yes, but not one likely to hold for long.

> I think this wildly underestimates how much valuable knowledge isn't widely held. And a lot of it isn't static.

Continual learning and context-specific memory should solve this. AIs will be witty and trustworthy, navigating social hierarchies and networks, managing reputations, building alliances, understanding their coworkers' motivations, reading body-english and facial expressions, reacting empathically, and drawing people into confidence with insight and instinct.

Yes, we'll always trust 3D humans over 2d avatars, so that is a real obstacle. But little by little, the sheer charisma and competence of AI should close the gap.

Expand full comment

Jeremy

Aug 5

My model of why senior executives earn their salaries did not include this kind of thing! I know nothing about the business world.

Expand full comment

Andrew

Aug 4

When you lay it out like this. It’s obvious that healthy organizations will always have a human in the loop to do a final approval of any decision that an AI makes. Any organization that doesn’t do this is asking for trouble.

This is similar to the argument that the AI Snake Oil people make in their “AI is normal technology” article.

Minor point:

> coding agents are good at writing tests

This has not been my experience. When I ask AI to write tests, they are often too complex (ie- all scaffolding and barely anything tested), or the tests don’t actually test what I want them to test.

Maybe I’m not prompting correctly or I’m using the wrong model, but my experience is middling here.

Expand full comment

Reply (2)

Jeremy

Aug 5

This is helpful for me.

Expand full comment

David J.

Aug 8Edited

> It’s obvious that healthy organizations will always have a human in the loop to do a final approval of any decision that an AI makes.

It is only "obvious" if your sentence is the definition of "healthy". Enough organizations have failed spectacularly for us to recognize that people in various groupings are far from perfectly rational.

More intellectual humility is needed; we're all human, we make mistakes. This applies to our physical actions as well as our mental models of the world. Let's strive to be honest about we know and what we don't know. How to do this? One of my favorites is to make a testable prediction, with probabilities, in public. This might be hard, but for things you care about, it is worth it.

Expand full comment

Sam Tobin-Hochstadt

Aug 5

QA people review entire software applications. Inspection for cars or phones are similar in scale of inputs. And indeed sometimes people in the production chain are malicious.

What you seem to be imagining is that we're going to automate the work of huge numbers of people at once but also not change anything at all about the processes they are part of. But why would we do that? That's not what happened when anything from weaving to steel to machine code production was automated.

Expand full comment

Rangachari Anand

Aug 5

Here's the thing: Agents have a huge attack surface and it's not at all clear how to secure them. I have no plans to ever give my web credentials to any kind of an agent.

Expand full comment

Reply (1)

David J.

Aug 8Edited

> I have no plans to ever give my web credentials to any kind of an agent.

This is a good intention and a good start. But many of us will forget these kinds of aspirations in the moment, out of convenience, time pressure, or laziness.

Therefore, most of us need more than a "plan to ourself" to not do something. We need something better: a protection system, including protections from our flaws. We need our thoughtful, rational, planning attributes to prevail.

But how? We can't leave it to luck. So that leaves habits and systems. We probably need *less* unbridled freedom moment-to-moment. We'll probably need to trade this for *more* long-term agency. Think of it this way: given the choice, wouldn't you prefer your current self to be able to protect your future self, at least in certain prescribed ways?

Even if you succeed, an agent might get your credentials some other way. Our attack surface is also large, after all.

Expand full comment

Reply (1)

Rangachari Anand

Aug 8

Good points - as I get older, my fear is that I will accidentally fat finger some obscure checkbox in some app that gives it more access than is safe.

Apple's big innovation with the App Store in my opinion was the stringent sandboxing of code to render it safe for the general public. We need something similar but an agent needs access to my data and credentials to be useful to me.

Not even clear what form this sandbox would take but surely somebody is thinking of this. For that matter, even running an agent on your *own* hardware will not necessarily improve privacy and security.

Expand full comment

Jeremy

Aug 5Edited

I appreciate your clear explanations and your calm-down perspective on risk, since I read a lot of doomers. I think you are right that much of the gap comes down to different intuitions, and is therefore hard to bridge. The rationalist alignment people's pseudo-formalism and "theorems", and their overbearing manner, are counterproductive for discussion with people outside their subculture.

At the same time, I wish people like them and people like you would engage more patiently and open-mindedly with each other's work - in this context, that *you* would with theirs. It might be productive to explore conflicting intuitions more deeply, less adversarially.

At a gut level, I am convinced by Nate Soares's talk at Google (I think) - using a clip from "Fantasia" of the Sorceror's Apprentice - that once AI gets to a certain level of capability, it will be vanishingly difficult to specify its goals precisely enough to protect against unintended consequences, no matter how closely we think we are monitoring it. [https://www.youtube.com/watch?v=dY3zDvoLoao&ab_channel=TalksatGoogle]

I think a lot of the intuition gap between you and them comes down to how powerful you see AI getting at reasoning. If it levels out fairly soon, your common-sense analysis is convincing to me.

Expand full comment

Reply (2)

Sam Tobin-Hochstadt

Aug 7

The basic issue with this analogy is that we are not building ML systems optimized with a single simple reward like that example; we're building generalist systems and using them as a building block. Which is why using Claude Code is a reasonable approach instead of something that spends 100% of its time reward hacking.

Expand full comment

Reply (1)

David J.

Aug 8Edited

> The basic issue with this analogy is that we are not building ML systems optimized with a single simple reward like that example; we're building generalist systems and using them as a building block.

Such systems can still be dangerous. They might want to "get out of the box" too! I'm thinking about the combination of a base model's goals with a user's goals (infused through an interaction). Consider this threat model:

Consider N people using the best-in-class AI system in year Y. What are the chances that zero of them want to hatch a powerful AI under their exclusive control? In so doing, what are the chances that all of them retain control of the thing? What happens next?

Expand full comment

Reply (1)

Sam Tobin-Hochstadt

Aug 8

Right, AI tools are powerful and if bad actors have them that will cause trouble. But like with "what if my subordinates aren't perfectly aligned with my goals", "what if these powerful new tools fall into the wrong hands" is a problem we are actually very experienced with.

Expand full comment

Reply (2)

David J.

Aug 9

> "what if my subordinates aren't perfectly aligned with my goals"

Ok, but this doesn't even begin to cover the issue. There is a big difference between: (a) typical human greediness and power-seeking behavior; (b) fraud risk from an insider threat; (c) corporate espionage via spycraft; (d) capable, non-human intelligence trying to copy itself and gain power; (e) superintelligence with recursive self-improvement

Expand full comment

Reply (1)

Sam Tobin-Hochstadt

Aug 9

Right the first three are all different and we have experience with them. The key move you're making is (d), where we assume that AIs are going to try to copy themselves and gain power in a way that is unlike how humans behave in similar situations. What's the reason we should expect AIs to be constantly trying to subvert their supervisors and gain power in ways that humans typically don't?

Expand full comment

David J.

Aug 9

> is a problem we are actually very experienced with.

I don't know what you are implying. Would you claim we're experienced enough to remain safe with 100% probability against a full set of attack vectors? 99% safe against 99% of threats? 95%? These are the stakes.

Expand full comment

Reply (1)

Sam Tobin-Hochstadt

Aug 9

Obviously we're not going to be 100% safe; we were not 100% safe against rogue actors creating biological weapons in 2015. Instead, what I think is that over the past 100 years the capability of individuals and small groups to cause death and destruction has gone up enormously (easily 6 orders of magnitude) and yet basically our approach for addressing that has both stayed the same and basically worked. Of course, terrorism is worse and more dangerous in many ways than 100 years ago but not in a way that radically destabilized society. I expect similar from AI.

Expand full comment

David J.

Aug 8Edited

Calmness and level-headedness are complex things. Sure, we want them as a tools for making sense of the world. Where does one's deep drive for seeking truth and understanding come from? In many cases, from deep drives within.

Sometimes outer calmness can be an unfortunate *consequence* of irrationality. Many are driven by a desire to see the future as comfortable and safe -- instead of out of control or terrifying. Why are such people calm? Maybe because the jarring reality isn't getting in. Maybe because it doesn't make sense from their frame of reference.

To be fair, some doom-predictors over history have their own bias: disasters, both real and imagined, can offer meaning and motivation.

Expand full comment

David J.

Aug 8Edited

Meta-comment: How many people here already decided if they were going to agree or disagree with Timothy's central claim even before reading the article? How many of us learned something? How many of us refined our logical arguments? Our mental models? Actually changed our mind? I invite people to give honest responses to these questions.

Expand full comment

Kenny Easwaran

Sep 11

I think it's worth looking not just at the techniques we use to keep human agents under control, but also the techniques we use to keep corporations and governments under control. You might think that corporations and governments are people, my friend, but I think their organizational structure often makes them act in emergent ways that no individual human wants. The British East India Company, General Motors, the Communist Party of the Soviet Union, and the US Government, are all entities that outlived any particular human involved in them, and there were relatively few periods that any particular individual had anything like total control of what they were doing. At their worst, some of them committed atrocities that no human intended, and at their best they've accomplished things that no human could. And throughout, all of them were working towards abstractions (like shareholder value, or the will of the proletariat, or voter intent) that are only partially aligned with what humans actually care about.

These systems are all designed with various kinds of checks and balances to try to make sure they don't malfunction too badly. But these checks and balances haven't always been effective (sometimes because some particular group of people with their own aligned interests got in charge of all the parts, but sometimes just because the logic of the organization made each person do things that resulted in outcomes that no particular individual was aiming towards).

I think the potential issue with AI may be similar - though it's also true that it's much easier to keep a lot of AIs "in a box" than it is for these sorts of entities.

Expand full comment

Understanding AI

Keeping AI agents under control doesn't seem…