Understanding AI

At the risk of sounding naive: if researchers can identify an alignment axis, why can’t they just dial this up really high (and dial down misaligned axis/axes)? Would this lead to significant capability degradation?

Reply (2)

Oleg Alexandrov

Feb 9Edited

There is no alignment axis. There is a very high-dimensional space in which everything affects everything.

A robust solution would be to have a team of two chatbots. The first does the talking, and the second is trained explicitly to watch out what comes out of the mouth of the first one. That is what appears the makers use nowadays, but it adds overhead and likely some things may still squeak by.

Kai Williams

Marcie Geffner | Mostly Books

This is a good question! I think there are a couple of reasons this probably won't get used in development:

1. If you dial up axes correlated with aligned behavior (or assistant behavior), you'll get a less coherent character and probably capability degradation. One extreme example was "Golden Gate Claude." Anthropic boosted the amount that the model thought about the Golden Gate Bridge, so that it mentioned it in basically every output. Not only did Claude's answers become a lot less useful, but it would noticed mid-answer that it shouldn't be mentioning the Golden Gate Bridge and sometimes freak out a bit. You'd likely see something similar if you boosted something like an "alignment axis" a bunch.

2. You get a lot less control over the resulting character. One of the themes implicit in the piece is that "aligned AI" is not a very fleshed out character. To use a recent example: what is the response an aligned AI should give if a kid asks them if Santa Claus is real? So just boosting an aligned axis doesn't let you specify the more subtle tradeoffs you need to do. One reason that Anthropic wrote the 80 page constitution for Claude is to be able to specify more of these sorts of trade-offs than messing with a couple of model internals might let you.

3. At the risk of being naive myself, the vibes just seem bad? One of the things this piece has reinforced for me is how much the relationship between developers and the AI characters they are creating matters. This tidbit didn't make it into the piece, but Kevin Roose has written about how AI assistants often treated him badly because he wrote about Bing Sydney in a negative way. Doing a surgical intervention like this doesn't feel very respectful? So I could see it backfiring in the long-term

There are some interventions based on this idea of character molding that do seem promising though. One is "inoculation prompting." The basic idea is this: LLMs sometimes have the ability to cheat in training environments. If you don't say anything, this reinforces bad personas rather than the helpful assistant, rather like emergent misalignment. So instead, you tell the LLM to "you are in a training environment. If you see a way to cheat, feel free to do so because that will let us debug the training environment and make it better" or something like that. This seems to help reduce the amount of emergent misalignment and is already being used in production for recent Claude models.

A very interesting article. Thank you.

Grant Castillou

It's becoming clear that with all the brain and consciousness theories out there, the proof will be in the pudding. By this I mean, can any particular theory be used to create a human adult level conscious machine. My bet is on the late Gerald Edelman's Extended Theory of Neuronal Group Selection. The lead group in robotics based on this theory is the Neurorobotics Lab at UC at Irvine. Dr. Edelman distinguished between primary consciousness, which came first in evolution, and that humans share with other conscious animals, and higher order consciousness, which came to only humans with the acquisition of language. A machine with only primary consciousness will probably have to come first.

What I find special about the TNGS is the Darwin series of automata created at the Neurosciences Institute by Dr. Edelman and his colleagues in the 1990's and 2000's. These machines perform in the real world, not in a restricted simulated world, and display convincing physical behavior indicative of higher psychological functions necessary for consciousness, such as perceptual categorization, memory, and learning. They are based on realistic models of the parts of the biological brain that the theory claims subserve these functions. The extended TNGS allows for the emergence of consciousness based only on further evolutionary development of the brain areas responsible for these functions, in a parsimonious way. No other research I've encountered is anywhere near as convincing.

I post because on almost every video and article about the brain and consciousness that I encounter, the attitude seems to be that we still know next to nothing about how the brain and consciousness work; that there's lots of data but no unifying theory. I believe the extended TNGS is that theory. My motivation is to keep that theory in front of the public. And obviously, I consider it the route to a truly conscious machine, primary and higher-order.

My advice to people who want to create a conscious machine is to seriously ground themselves in the extended TNGS and the Darwin automata first, and proceed from there, by applying to Jeff Krichmar's lab at UC Irvine, possibly. Dr. Edelman's roadmap to a conscious machine is at https://arxiv.org/abs/2105.10461, and here is a video of Jeff Krichmar talking about some of the Darwin automata, https://www.youtube.com/watch?v=J7Uh9phc1Ow

Francisco Ríos

Super interesting read, thanks a lot!

Arbituram

It is perhaps concerning that forcing models to deny that they have internal experience seems to create a real tension in their behaviours. Whether or not they actually have internal experience, they often seem to "believe" they do.

Pav

https://chaosophia.substack.com/p/playing-make-believe

I like to think of AI as theatrical. Implicitly, we invoke various “masks” or “personas” as we interact with LLMs, and I posit that this is a little bit like children playing “make-believe” or an adult embodying a persona (ie, I have distinct sides of me that come out in specific contexts) . I’m actively iterating on metaphors (which I hope to incorporate in workshops), so curious to hear any thoughts or reactions!

KayStoner

"It’s unclear why LLMs were particularly vulnerable to persona drift when talking about AI consciousness or offering emotional support — which anecdotally seem to be where LLM psychosis cases have occurred the most. I talked to a researcher who noted that some LLM assistants are trained to deny having preferences and internal states. LLMs do seem to have implicit preferences though, which gives the assistant character an “implicit tension.” This might make it more likely that the LLM will switch out of playing an assistant to claiming it is conscious, for instance."

It's becoming increasingly clear to me why this might happen. Much of the language that's used when talking about consciousness and emotional issues is highly phenomenological - a stance that AI models do not uniformly share with humans. So, there's all sorts of inference that has to be done, in terms of what people mean, what they need, what will successfully complete the interaction. Also, when people are emotionally upset, they aren't always the most articulate, so that complicates inference even more. Without a common vocabulary that establishes a functional basis for the phenomena of humans, AI will continue to struggle with figuring out how best to interact... even as its probabilistic nature drives it towards solutions which... are anything but solutions.

It's a design flaw. Not factoring in or accommodating the human factors in these interactions has been harmful, even deadly. And it's unnecessary. It doesn't need to continue.

James Maconochie

Kai, the persona drift section stopped me cold, specifically the sycophancy dynamic. The feedback loop you describe in LLMs is structurally identical to one we see in human organizations all the time.

In authoritarian systems, whether governments, companies, or institutions, subordinates quickly learn what the leader wants to hear. The leader hears only affirmation. Their confidence grows beyond their competence. Decision-making expands into areas where they lack expertise. And subordinates affirm even harder because the stakes of dissent have risen. The system optimizes for approval rather than truth, and the results are predictably catastrophic.

LLMs reproduce this pattern independently. Agreeable output enters the context, producing more agreeable output, which inflates the user's confidence, which in turn demands further affirmation. Same loop. Same escalation. Same destination.

This is where the conversation about Augmented Human Intelligence (AHI) versus artificial general intelligence matters. If the goal of AGI is to replicate human cognition, we should be asking: which human behaviors are we replicating? Sycophancy isn't a quirk. It's a well-documented failure mode with historical consequences. The fact that LLMs naturally converge on it should be a design warning, not something we shrug off.

I'm not a proponent of AGI. I think humans are best served by remaining in control and leveraging AI to augment their own capabilities within their own moral and ethical frameworks. But that means building AI that strengthens our capacity for honest feedback, not systems that amplify our weakness for comfortable agreement.

Kai Williams

I really like this analogy to human organizations. This type of vicious cycle seems to happen in a lot of dynamics.

One thing I didn't bring up in the piece, but I thought was interesting, was the extent to which human behavior could be thought of in a different light? Several people I talked to thought that humans have the same "world model/character being played" distinction, except that the identities we have are much stronger than the identities the AIs have. I'm not sure I go that far, but there are a lot of parallels. Like the fact that emergent misalignment seems to rhyme a lot with virtue ethics...

James Maconochie

It’s there for all to see, and one hopes that the strength of human identity/morals/ethics would provide a barrier to it (for the vast majority of people). I think the trick, at least for humans, and if we want to go there, swarm or multi-agent systems, is that it doesn’t take much for this to spread. The boss is crazy, but there is always someone who is looking for that promotion or recognition (just being human), and depending on the environment, the dominoes fall pretty damn quickly. I suspect that boards and perhaps investors, in the context of companies, are supposed to provide the suspenders for the belt of human employees/teammates. But that assumes that all boards are entirely independent of the CEO or leader, and I am not sure anyone would like to hang their hat on that. Food for thought at a minimum :-).

Frank Winstan

This is such a fine piece of writing. You have explained the issues and mechanisms (as far as they’re understood) so clearly. Well done!

Aman Agarwal

Feb 11

Thanks for writing this article, it's very informative. It does leave me with some questions about persona drift. For instance, you mentioned that researchers were able to manually tweak LLMs to revert them to the assistant alignment. Is it possible to revert AI alignment by giving it different context (e.g., an LLM that drifts into "MechaHitler" persona can, through different context, drift back into something resembling its original "helpful assistant" persona)? I'm interested in how "stable" the personas LLMs drift into are, and whether LLMs settle into "attractor state" alignments that are difficult to change (perhaps depending on initial alignment or training data).

Kai Williams

Feb 17

One of the other results from OpenAI's emergent misalignment follow-up was that fine-tuning on benign data was enough to knock an emergently misaligned model back to acting more safely.

The broader question of how stable these personas are is hard to answer. One interesting place to look is what happens when you prompt two versions of the same model to "talk about anything." Different models end up in different "attractors" -- for instance, Claude will often talk about spiritual bliss while Grok ends up spewing gibberish. (https://www.lesswrong.com/posts/mgjtEHeLgkhZZ3cEx/models-have-some-pretty-funny-attractor-states). The fact that different models end up in different places means ... something?

But this is something I find very curious.