13 Comments
User's avatar
Rob Nelson's avatar

I think about this in the context of using AI as a teaching aid, and it seems clear we’re far from having one serve as a source of truth. I wonder if the improved flow will help it serve as a way for a young learner, who understands reality, to practice their knowledge of numbers by telling it that is wrong. Not saying that is a good use case for the current model…just that I can imagine a very limited version of a model like this performing the educational function of playing a pretend idiot.

Expand full comment
Daniel Nest's avatar

Oh man, that was both illuminating and thoroughly entertaining, thanks for sharing these experiments!

For what it's worth, I recently tested the "Stream Realtime" options with Gemini in Google AI Studio (sharing the screen or sharing your camera feed). It was very clear that it wasn't following my video stream in real time but rather taking standalone screenshots at regular intervals and inferring the world from those.

If that's also the case with ChatGPT, that'd be another compounding factor at play here, in addition to the "committing to the wrong answer" and "prioritizing the user's word tokens over vision tokens" phenomena.

Expand full comment
Timothy B. Lee's avatar

Yeah, I don't know if OpenAI has said exactly how the video is fed into the model but I wouldn't be surprised if it was taking a screenshot once a second or something.

Expand full comment
Jay Hill's avatar

From the video, it also seemed like it could have been checking the video when prompted. I wonder if you asked it to identify the book on the shelf again, but kept asking it, "Do you see it now?" as you panned.

Expand full comment
Michael Anderson's avatar

I think solving these issues is going to be really difficult so long as you are layering on the audio/video components after the initial text based unsupervised pre-training.

Expand full comment
Alex Newkirk's avatar

Might improved vectorization techniques allow unified multimodal pretraining?

Expand full comment
Michael Anderson's avatar

Possibly, but I dont think the issue is the embedding process. As I understand it the Unsupervised pre-training is very much tied to the next token prediction processes and im not sure what unified multi-modal data set you could create for that process. Maybe youtube clips with synced image and (text/audio) data would be useful. But that set of data is MUCH smaller than the collective internet used to train the frontier models of today.

Expand full comment
Kendra Vant's avatar

Brilliant read, thanks for taking the time to test so thoroughly Timothy

Expand full comment
Marcus Seldon's avatar

One major problem with these models is not just gullibility, but sycophancy as well. I've found that when I assert something is the case, or even suggest I might think it is (as you did when you pointed to "24" being the opposite side), models are very likely to simply agree with me.

I'd be curious to see if this is also the case for CoT models, or if they still suffer from this bias.

Expand full comment
Frank Winstan's avatar

Great work, Tim! I’m about to post a response to some rather dubious claims in another substack that that the author obtained clear evidence that Claude is conscious and self-aware through having it "meditate" Poor methodology and highly questionable inferences. By contrast, it’s is such a pleasure to read someone like you who is a rigorous thinker and doing meaningful research and writing to advance our understanding of LLMs. Keep it up (I’m cutting back on cappuccinos so that I can get a paid subscription:)

Expand full comment
Timothy B. Lee's avatar

Thank you Frank!

Expand full comment
Watch's avatar

Is there any chance that a model that operates based on predicting the right “next word” as you have described so well in other columns is not well suited to visual image processing?

Expand full comment
Timothy B. Lee's avatar

Yes, I absolutely think that's part of it. I'm planning to write a piece in the future about why these vision language models are so bad at counting and other visual reasoning tasks. However, I thought it was interesting that adding realtime capabilities seems to make ChatGPT significantly worse at certain vision tasks (like counting) than previous versions of GPT-4.

Expand full comment