What I learned trying seven coding agents
There's still room for improvement, but don't underestimate this technology.
It’s the final day of Agent Week, which means it’s also the final day to get 20 percent off an annual subscription to Understanding AI. Please click here to support my work.
In recent weeks, I’ve written multiple articles about the new generation of coding agents that have come out over the last year. It seems clear that agents like these are going to change how programmers do their jobs. And some people worry that they’re going to put programmers out of work altogether.
It’s impossible to truly understand a tool if you haven’t used it yourself. So last week I decided to put seven popular coding agents to the test.
I found that right now, these tools have significant limitations. But unlike the computer-use agents I wrote about yesterday, I do not view these coding agents as a dead end. They are already powerful enough to make programmers significantly more productive. And they will only get better in the coming months and years.
I also talked to an experienced programmer about how he uses tools like Claude Code. The conversation gave me better insight into how these tools are likely to change the programming profession—and perhaps other professions in the future.
Hands-on with seven different coding agents
LLMs tend to do well on “canned” problems, but they struggle with more complex real-world tasks. So I wanted to use some real-world data for my own testing. I chose a subject that will be familiar to long-time readers: Waymo crashes, which are far less common, per mile, than crashes by human-driven vehicles.
Specifically, I asked coding agents to merge two spreadsheets—one from Waymo and the other from the National Highway Traffic Safety Administration—and then build a website to search, sort, and browse data on crashes involving Waymo vehicles.
I tried seven agentic coding platforms, playing the role of a coding novice. At no point did I look at any code to try to figure out what might be causing a bug—I simply gave the agents high-level instructions and relied on them to figure out how to implement them.
The results were decidedly mixed. Let’s start with the least successful coding agents and move to the more successful ones.
Bolt.new immediately gave up, stating “looks like you've hit the size limit.” A help page states that “if your Bolt project exceeds the context window (200k tokens for free accounts, 500k for paid accounts), Bolt notifies you in the chat, saying Project size exceeded.” Evidently, my spreadsheets were too large to fit into the model’s context window.
Other agentic coding platforms give models tools for searching through files and accessing portions of them on an as-needed basis. This allows them to work with code and data that is much larger than the model’s context window. Bolt doesn’t seem to have this feature, so it wasn’t able to process a large spreadsheet.
Replit couldn’t make a website. Replit made a plan and chugged along for 17 minutes trying to implement it. It claimed the website was ready, but I couldn’t get it to load. Replit couldn’t fix the problem no matter how many times I asked.
Lovable generated an attractive website but struggled with importing data. With repeated prompting, I coaxed Lovable to import most of the data and convert it to the right format. But as I write this, the values in the date column are all “N/A” despite me repeatedly asking Lovable to fix this.
Windsurf showed early promise. It created a basic but working website more quickly than any of the other coding agents. And I was able to make some changes. But it was brittle. For example, I asked for a pop-up menu to filter by injury severity (minor, moderate, severe, or fatal). Windsurf created a menu but couldn’t get the filtering to work properly—even after several attempts.
OpenAI’s Codex got the job done but the results lacked polish. One of my source spreadsheets had dozens of columns. Other coding agents all chose a handful of important columns—like location, date, and injury severity—to display. In contrast, Codex included all of the columns, making the site several times wider than the browser window. Even after I spent several minutes working with Codex to improve the site’s appearance, it still had more quirks and minor formatting issues than sites created by other coding agents.
Cursor did a solid job. I wasn’t initially optimistic about Cursor, which spent a long time setting up a development environment on my local computer with multiple false starts. The first website it created didn’t work at all; it showed an endless “loading” indicator. But with some coaxing, Cursor fixed that problem and created a nice-looking website. I could request modifications (like adding or removing columns, adding search criteria, or changing formatting) and most of the time it would succeed on the first try. This is the vibe-coding dream, and only two of the seven coding agents achieved it.
Claude Code did the best job. It was able to complete the task almost straight through with minimal backtracking. Once I had a basic site set up, I could request new features or design changes and Claude was often able to get them right on the first try.
Lessons from my vibe-coding experiment
I want to start by repeating what I said in yesterday’s post: it’s amazing that these products work as well as they do. Like the computer-use agents, these coding agents are the result of combining reinforcement learning with tool use. They all have significant limitations, but all of them are also far more powerful than any coding tools that existed even a year ago.
There seems to be a tradeoff between user-friendliness and power. Bolt, Lovable, and Replit market themselves as “vibe coding” platforms that allow non-programmers to create entire websites with a single prompt. When this works, the results can be stunning. But all three platforms seem to struggle if you do ask them to do something too ambitious or off the beaten path.
At the opposite extreme, Claude Code and Codex are designed for use by professional programmers. To use them, you need to get an API key and know your way around a Unix command line. I can easily imagine novices finding them overwhelming. But on the flip side, they seem to be significantly more versatile.
I was surprised by how much using these agents involved repeated trial and error. There were many times when I reported a bug, the agent tried to fix it, and I said “it still doesn’t work, try it again.” Sometimes we would do this five or ten times in a row, with the agent seeming to try a different approach each time. And sometimes it would succeed after a series of failures.
But there were other bugs that an agent just couldn’t fix no matter how many times it tried. I observed this with Lovable, Replit, and Windsurf during my testing.
This is not a big deal if you’re an experienced programmer who can peek under the hood, figure out what’s wrong, and give the agent more detailed instructions. But if you’re a non-programmer trying to write code entirely by prompting, it can be a huge headache.
How the pros use coding agents
My website for browsing crash data is quite modest as programming projects go. Lots of companies build and maintain much more complicated and consequential software. Pure vibe coding won’t work for larger software projects because even the most sophisticated agents are eventually going to get stuck.
Coding agents can still be useful for larger projects, but they require a different approach. To learn more about this, I spoke to Understanding AI reader Aaron Votre, a software developer at the disaster recovery startup Bright Harbor. Votre has been a heavy user of Cursor and Claude Code, and he says it has dramatically increased his productivity. He agreed to let me look over his shoulder as he worked on a real software project.
We saw yesterday that computer-use agents tend to fail if they lack sufficiently precise or detailed instructions. The same is true for coding agents: the more context the programmer provides, and the more specific the instructions are, the more likely the coding agent is to produce the outcome the programmer is looking for.
“We have a long file that has hundreds of lines of all of our guidelines,” Votre told me. The file, which is always included in Claude Code’s context window, includes information about which software tools the company uses for various functions. It also includes the kind of advice the company would give a new engineer.
The file includes instructions like “fix each issue before proceeding to ensure code quality,” “sort aliases alphabetically,” and “test individual formatting functions separately for better isolation.”
“Every time it does something wrong, we add to this,” Votre said. “Which is why Claude is the best for us right now because we have the most set up for it.”
During my vibe coding experiments, I didn’t provide coding agents with this kind of guidance. And this undoubtedly made the agents’ jobs harder.
For example, several times Cursor tried to run software that wasn’t installed on my laptop. It was then able to either use a different tool or install the one it needed, so this wasn’t a big problem. But it would have been better if I’d told it which tools to use at the outset.
And this kind of thing can be a much bigger deal in a larger project. A lot of considerations go into choosing which software components to use for a large project, including cost, security, and compatibility with other software. Even the best of today’s agents need guidance to make sure they make choices that fit with the company’s broader goals. And the best way to do that is to have a file that explicitly lists which tools to use in which situations.
Votre said another key strategy for using software agents on large software projects is to have the agent write out a detailed plan. The human programmer can review the plan and make sure it all makes sense—if not, he can modify it. Reviewing a plan like this is an opportunity to detect ways that the initial instructions might have been imprecise or confusing. And it also gives the programmer an opportunity to provide additional context.
The future of programming is English
What both of these strategies—providing a big file full of context and reviewing the agent’s plan before it’s executed—have in common is that they require the programmer to have a detailed understanding of the company’s code. It’s different from the “vibe coding” vision where someone with no programming background gives a high-level objective and lets the agent figure out the details.
People like to talk about coding agents replacing engineers, but I think it makes more sense to think about this the way Andrej Karpathy put it a couple of years ago: “The hottest new programming language is English.”
At the dawn of the computer industry, programmers had to write code directly using low-level mathematical operations like add and multiply. In the 1950s, people started inventing programming languages like Cobol and Fortran that allowed programmers to write higher-level instructions and have a computer translate those instructions into lower-level binary code.
That process has continued over the decades—modern programming languages like Python come with extensive function libraries that allow powerful programs to be written in just a few lines of code. Thanks to better libraries and frameworks, programmers are far more productive today than 10 or 20 years ago.
Coding agents are the next rung on this ladder of generality. In the past, programmers would give a computer instructions in a language like C++ or Python and a compiler or interpreter would translate it into binary machine code. Today, programmers can give a computer instructions in English and a coding agent will translate it into a programming language like C++ or Python.
This new paradigm means that programmers have to spend a lot less time sweating implementation details and tracking down minor bugs. But what hasn’t changed is that someone needs to figure out what they want the computer to do—and give the instructions precisely enough that the computer can follow them. For large software projects, this is going to require systematic thinking, awareness of tradeoffs, attention to detail, and a deep understanding of how computers work. In other words, we are going to continue to need programmers, even if most of them are writing their code in English instead of C++ or Python.
And I think something similar will be true in other professions impacted by AI agents in the coming years. There will undoubtedly be legal agents that help lawyers write legal briefs, revise contracts, and conduct document reviews. But someone still needs to figure out what jobs to give these agents—and then evaluate whether they’ve done a good job. To do that job well will require legal expertise. In other words, at least for the next few years, it will make more sense to think of legal agents like this as a new tool for lawyers to use rather than as replacements for the lawyers themselves.
I am over 81 years old, and I used to program standalone Microsoft Executable Utilities that I would post online. I saw that these old utilities might not run in native mode with the advent of ARM computers. Rather than letting them possibly die, I use ChatGPT to convert some of them to successful standalone JavaScript utilities that I have uploaded here:
https://qb45.org/files.php?cat=2
Though this is “Vibe Coding” since I didn’t first start to learn JavaScript due to my age, I was able to succeed because I knew in intimate detail how my Windows executables were structured and was able to give precise prompts to ChatGPT.
Wow, this is fascinating. I use Cursor, with Claude as its backend. I recently went down a rabbit-hole where Cursor helped me refine a unit test. The test result got better and better, but I realized partway through that I was simply overfitting on that test, and the code was actually hiding problems. I find that Cursor generally helps a lot, but it doesn't replace human judgement in many situations.