17 Comments
User's avatar
Nathanael Ries 🇺🇸's avatar

You need to stop using chatGPT.

It's not currently the best model.

Go ahead and spend some money and use a real AI agent that has memory, like Claude 4.7

If you legitimately give it articles you actually wrote, it can start recognizing when you upload things you did not write do not belong to you.

Courtney's avatar

Yeah you sound like a shill. Nearly everybody who uses LLMs on a serious basis understands that every model is useful for different things and no one straight up writes anything off.

Claude code is still pretty far behind Codex in several ways, most notably long running tasks. They even admit that in their own writings, that’s why it’s such a high priority for the team.

If you think one model is drastically worse than the other it’s something it’s probably because of your prompts and your lack of context or intent. Actually you can prove that if you still have a go to prompt or a list of prompts that you’ve booked Mark because prompt Engineering is not the way to get the best results. Even anthropic has several guides telling you that exact same thing.

Timothy B. Lee's avatar

Hi Nathanael! ChatGPT recognized me so I don't quite understand your comment. Probably if I tried Claude it would also recognize me. I'm not sure what that would prove.

An interesting question is whether I could teach ChatGPT or Claude to recognize new authors whose work isn't in the training set of the underlying model. I am skeptical of this for the reasons laid out in the article, but it's a hard question to test since I don't have a large trove of unpublished articles by various writers.

Kevin's avatar

From my experience working with scientists, extracting deep insights is like 0.1% of the job. Many professional scientists aren't expected to extract deep insights at all - the insights are often obvious given the data, and the vast majority of the job is data collection. Or they're working in a lab and it's someone else's job to extract insights.

Or consider reproducing research papers. Checking to be sure if a research paper is reproducible is valid science, right? Arguably we need more of that. But there isn't even supposed to be any deep insight there. You still need to hire scientists for this, because so many of the practices in a field are only known by the scientists working in it.

I'm sure there are many scientists whose jobs cannot be replaced by AI. But the "science" part doesn't seem like the hardest part to replace - the AI is much worse at, for example, delivering powerpoint slides asking for funding. A critical part of the PI job! A lot of entry level scientists seem like they could have all of the work they would do five years ago entirely replaced with a modern AI.

Kenny Easwaran's avatar

There isn’t *supposed* to be any deep insight in replicating an experiment. But that’s only true if the initial author managed to right down all the tips and tricks needed to get the initial experiment working. You often have to fly out someone from the initial lab to show you where they had to kick the machine or duct tape something to get it working.

Kevin's avatar

That sounds perfect, I bet the AI could also fly someone out from the initial lab to ask them questions ;-)

Kenny Easwaran's avatar

This is the point Dwarkesh Patel made last year in talking about “continual learning”. And I’m arguing in a talk I’m giving these days that it’s actually very close to the point that Hubert Dreyfus was making about expert systems back in the 1980s (and about a lot of analytic philosophy). It’s true that a lot of intelligence can be reduced to knowledge that can be expressed as sentences in a language. But there are things you need to practice and optimize on and can’t express in words.

Modern LLMs do a great job of improving on expert systems by having a bottom layer that has trained and practiced. All the types of reinforcement learning they’re adding are doing more. But they don’t do any better on your own task than the instructions you can write down unless that task makes it into the reinforcement learning loop for the next model.

Russell Hawkins's avatar

Has your talk been recorded and posted anywhere?

James Maconochie's avatar

Tim, one of the cleanest popular-press articulations of the implicit-knowledge problem I've read, the temp-worker analogy in particular does real work.

One push: the reason a seasoned hunch is trustworthy isn't just that it's pattern-rich. It has carried a consequence. The practitioner has made calls, paid for the wrong ones, and adjusted. That loop gives the hunch its weight. An LLM mid-session has nothing in the loop that bears a cost, which means this isn't a context-window problem, it's a stake problem.

Extended the thought (and where it goes for governance) here: https://substack.com/@jammit1994/p-196711207

Oleg  Alexandrov's avatar

"LLMs seem to lack a capacity for continual learning: the ability to recognize new patterns in — and form new hunches about — information they encounter at inference time."

We have two modes now. One is the online mode described in this article. The AI agent diligently reads and writes notes as it works. This is indeed proving to be quite revolutionary for Claude.

The second is to at some point take all the accumulated knowledge and retrain the model from scratch. This can make much deeper connections.

While I understand the issues raised in this article, it is not fully clear to me humans genuinely do something totally different than these two modes.

At most one could argue humans have more granularity so can refresh their mind faster and without total reset. This is surely smarter and more efficient. But is it fundamentally different?

Oleg  Alexandrov's avatar

Maybe text itself is lossy, and verbalization is the root of the problem.

Sean Trott's avatar

I always start from the position that we know very little about human cognition and whether the best way to think about it is, in fact, in terms of information processing.

But assuming this is a helpful description of what humans are doing, it’s in part an empirical and engineering question: how feasible and reliable is it to fine-tune a model on a new corpus (like all your chats) and have it glean the right insights (while not forgetting other crucial stuff, etc)? I don’t know—someone more familiar with fine-tuning than me could probably answer that. Setting aside the computational demands, I assume part of the challenge is curating the continual learning corpus—you don’t necessarily want to learn from every new input. Maybe another llm could do that? It’s an interesting question.

Oleg  Alexandrov's avatar

We will see. We barely started to capitalize on what we got so far and success of Claude was far from assured given where we were last year. There's so much work within reach of current and near-term AI.

Simple John's avatar

"translate them to English, Python, or any other explicit form"

The word "explicit" crystalized for me why talking about LLMs in relation to testable languages like coding languages and hand waving languages like most human languages in the same breath is inherently misleading.

LLM results that can be expressed in Python et al are explicit (definite in form and content). English et al are only explicit when stating names (Charlie Chaplin) and pointing words (this sentence).

Tim: As much as you did deliver information useful to me throughout your post, your wrapping it in an apparent acceptance of sameness of scopes of the two language types will encourage some lazy folks to believe they behave the same in terms of delivering something actionable. I know this because I am lazy as often as not.

Sam Tobin-Hochstadt's avatar

It isn't mostly the point of this post, but I still want a definition of an "automated AI scientist" that can be operationalized and hasn't already happened.

Tim Tyler's avatar

Technically, continual learning is almost trivial - just keep applying backprop during inference. The problem is not really continual learning - it is that LLM learning itself is slow. So continual learning offers few benefits - and has substantial costs - regular brain wipes are a safety feature.

Tim Tyler's avatar

What about AI computer scientists? They can do their own experiments easily enough.