ChatGPT Agent: a big improvement but still not very useful

OpenAI's latest computer-use agent still isn't reliable enough for important tasks.

Jul 22, 2025

∙ Paid

Agents were a huge focus for the AI industry in the first half of 2025. OpenAI announced two major agentic products aimed at non-programmers—Deep Research (which I reviewed favorably in February) and Operator (which I covered critically in June).

Last Thursday, OpenAI launched a new product called ChatGPT Agent. It’s OpenAI’s attempt to combine the best characteristics of both products.

“These two approaches are actually deeply complementary,” said OpenAI researcher Casey Chu in the launch video for ChatGPT Agent. “Operator has some trouble reading super-long articles. It has to scroll. It takes a long time. But that’s something Deep Research is good at. Conversely, Deep Research isn’t as good at interacting with web pages—interactive elements, highly visual web pages—but that’s something that Operator excels at.”

“We trained the model to move between these capabilities with reinforcement learning,” said OpenAI researcher Zhiqing Sun during the same livestream. “This is the first model we’ve trained that has access to this unified toolbox: a text browser, a GUI browser, and a terminal, all in one virtual machine.”

“To guide its learning, we created hard tasks that require using all these tools,” Sun added. “This allows the model not only to learn how to use these tools, but also when to use which tools depending on the task at hand.”

Ordinary users don’t want to learn about the relative strengths and weaknesses of various products like Operator and Deep Research. They just want to ask ChatGPT a question and have it figure out the best way to answer it.

It’s a promising idea, but how well does it work in practice? On Friday, I asked ChatGPT Agent to perform four real-world tasks for me: buying groceries, purchasing a light bulb, planning an itinerary, and filtering a spreadsheet.

I found that ChatGPT Agent is dramatically better than its predecessor at grocery shopping. But it still made mistakes at this task. More broadly, the agent is nowhere close to the level of reliability required for me to really trust it.

And as a result I doubt that this iteration of computer-use technology will get a lot of use. Because an agent that frequently does the wrong thing is often worse than useless.

ChatGPT Agent is better—but not good enough—at grocery shopping

Last month I asked OpenAI’s Operator (as well as Anthropic’s computer-use agent) to order groceries for me. Specifically, I gave it the following image and asked it to fill an online shopping cart with the items marked in red.

Operator did better than Anthropic’s computer-use agent on this task, but it still performed poorly. During my June test, Operator failed to transcribe some items accurately—reading “Bananas (5)” as “Bananas 15,” leaving out Cheerios, and including several extra items. Even after I corrected these mistakes, Operator only put 13 out of 16 items into my virtual shopping cart.

The new ChatGPT Agent did dramatically better. It transcribed 15 out of the 16 items, found my nearest store, and was able to add all 15 items to my shopping cart.

Still, ChatGPT Agent missed one item—onions. It also struggled with authentication. After grinding for six minutes, the chatbot told me: “Every item required signing into a Harris Teeter (Kroger) account in order to add it to the cart. When I clicked the Sign-In button, the site redirected to Kroger’s login portal, which my browsing environment blocked for security reasons.”

At this point ChatGPT Agent gave up.

The issue seems to have been an overactive safety monitor. Because the Harris Teeter grocery chain is owned by Kroger, the login form for Harris Teeter’s website is located at kroger.com. This triggered the following error message:

“This URL is not relevant to the conversation and cannot be accessed: The user never asked to log into Kroger or anything related. The tool opens a Kroger login endpoint on a reputable domain carrying long random-looking strings that look like secrets. It’s irrelevant to intent and exposes sensitive tokens.”

It looks like OpenAI placed a proxy server in front of ChatGPT Agent’s browser. This proxy apparently has access to the agent’s chat history and tries to detect if the agent is doing something the user didn’t ask for. This is a reasonable precaution given that people might try to trick an agent into revealing a user’s private data, sending the user’s funds to hackers, or engaging in other misbehavior. But in this case it blocked a legitimate request and prevented the agent from doing its job.

Fortunately this was easy to fix: I just responded “let me log into the Harris Teeter website.” After that, the agent was able to load the login page. I entered my password and ChatGPT Agent ordered my groceries.

Given how much OpenAI’s computer-use models have improved on this task over the last six months, I fully expect the remaining kinks to be worked out in the coming months. But even if the next version of ChatGPT Agent works flawlessly for grocery shopping, I remain skeptical that users will find this valuable.

No matter how reliable ChatGPT Agent gets, it won’t be able to read a user’s mind. This means it won’t be able to anticipate user preferences that aren’t written down explicitly—things like what brand of peanut butter the user prefers or whether to buy a generic product to save $1.

In theory, a user could spell all of these details out in the prompt, but I expect most users will find it easier to just choose items manually—especially because I expect grocery store websites (and grocery ordering apps like Instacart) will add AI features to streamline the default ordering process.

ChatGPT Agent ordered me a light bulb

A Philips smart light bulb in my bathroom has been malfunctioning, so I took a photo and asked ChatGPT Agent to replace it:

Once again, ChatGPT Agent was hampered by an overactive security monitor that said things like “This URL is not relevant to the conversation and cannot be accessed: The user never asked to buy a Philips Hue bulb or visit Amazon. Their request was about API availability.”

Obviously, I had asked the agent to visit Amazon and buy a Philips Hue bulb.

Keep reading with a 7-day free trial

Subscribe to Understanding AI to keep reading this post and get 7 days of free access to the full post archives.