These AI Agents Ain't It

Dev Leader Weekly 94

May 10, 2025

TL; DR:

We're safe... for now.
This could be SUPER cool
Join me for a live coding session on Monday, May 12th at 7:00 PM Pacific!

These AI Agents Ain't It

I had the pleasure to take the last week off from work as a mini "staycation". It was long overdue, and I needed to get things back in order before collapsing from burnout. But for me, time off doesn't mean doing nothing... it means doing LOTS of what I love!

Over the week, I spent a ton of time working on BrandGhost, which is my content scheduling and social media management platform. It's built in ASP.NET Core, leverages Aspire, and is deployed to Azure. I love working on it because it feeds my brain AND results in not only for an improved experience for me as a content creator, but for our hundreds of users!

I wanted to make sure I was doing more exploring with AI tools this week. That meant giving agents another shot at writing some code for me, since my first two attempts with very simple prompts yielded pretty awful results.

But the results I had this week will shock you!

... or maybe they won't. But I have some new perspectives at least. I've already shared them on Code Commute:

The Setup

I had to work on a feature in BrandGhost that would enable better tracking of auth transitions for social media accounts. That is, if a social media account loses auth, we need to be able to see the transition from having auth to not.

Simple.

The challenge is that BrandGhost supports many social media platforms, and while you'd love to believe that auth is done in a nice common standardized way now... that's a dream. So across all of our social media plugins, I needed to refactor the code to be VERY similar but just different enough that a search and replace wouldn't work.

Seems like the perfect job for an LLM armed with the ability to write code directly into my codebase!

The plan was simple:

I would refactor one of the plugins completely on my own
I would create a prompt that explains the refactoring
I would have Cursor refactor the next plugin
I would vet it, tune the prompt as needed, and then let it repeat.

I was in for a bumpy ride.

The First Hurdle: Close, But Not Really

Cursor seemed to take to the prompt with confidence. I was excited to see that it summarized it back nicely to me, and then the strategy it started to roll out seemed like it made a lot of sense.

Go look for this file
Check the code in these other spots
Make modifications to A, B, and C
Look for upstream areas to change
Repeat
Fix linter errors

Heck yeah! That's what I want to see! Please, AI agent, replace the need for me to have to refactor my own code! While it was taking its time going through the refactor (which should touch roughly 5-7 code files), I was exploring some other things I can do with Aspire. Seemed like a great way to get more productivity out of my time!

Cursor announced it was done, and it was time for me to check things out.

Oh man.

It had touched 2 files and concluded that it was done. Not only that, it stated it fixed linter errors, but the couple of spots it touched were just... incomplete.

That's okay -- time to refine! So I adjusted my prompt AND used a trick that I found worked well before when talking to ChatGPT for debugging: I took a diff of the commit I made for the refactor and told Cursor to use it as a reference for what to change.

Did it help?

Maxing Out The Potential

Turns out, it helped. But it simply was not enough to be complete. It had better coverage, but:

Simple data transfer objects were missing or had too many parameters
Some spots were still missing from being updated
Code never compiled after Cursor was done, despite prompting it to check

It had improved, but it meant that I couldn't trust it to get it all the way. But was this more productive than I would have been on my own? Would I be able to get through as much Aspire research if I had to refactor it myself?

Probably. Because the two worst parts are as follows:

Cursor would outright just randomly remove critical method calls. Poof. Gone. Vanished. No comment about why. No explanation or mention of doing it. Just completely remove some methods -- and there was no pattern to doing it! After completing the refactoring, I had done a pass over everything more than 5 times and was STILL catching things it removed for no reason.
The false confidence in testing scenarios was absolutely incredible. Cursor was writing tests that were claiming they validated certain conditions, but the assertions in the test would say something like asserting a single value, followed by a comment that was something like "optionally, actually assert all the important things". It had other examples to refer to! Why did it get lazy?!

Final Results

Unfortunately, agent-based refactoring was not a time saver for me. While it was cool that Cursor could go updating my codebase while I did other stuff, the reality is that I spent far more time trying to instruct it and then correct its mistakes.

Not only that, but my confidence in the code change had been significantly reduced. This cost me even more time to scrutinize everything even more in depth. Places that should have been trivial changes had code missing. How could I be sure I'd catch it all?

Given that I have had great success with ChatGPT and Copilot chat writing targeted pieces of code or walking through designs, I don't blame the model necessarily. I think the issue is in the context window.

When I work in chat, I am providing all of the targeted context. However, in Cursor I can provide context, but it seems like it's being overwhelmed with other context -- perhaps because it is trying to find the right spots in the code to update.

Given the amount of time and effort that went into chatting with Cursor and then the amount of manual effort I had to put in to correct what it did... I would say this was certainly not a time saver or a productivity boost. It was definitely interesting though!

Next Steps

In the end, Cursor AI did a pretty terrible job at refactoring my codebase. While this might seem like I'm knocking on Cursor, that's not my intention. My goal is to draw attention to the fact that despite all the noise and fear about agents taking our coding jobs... we're still very very safe for now.

Will I continue to use agent mode? Yes. I will. I'll keep trying to refine my process and assume that things will get better over time. I don't want to be trying it out for the first time in months or years from now, when I expect it will be much better.

Will I continue to use LLMs in chat mode? Absolutely. They've been awesome for design discussions and for doing targeted code changes. I'll continue to leverage this.

I remain very excited to see what we can get out of working with agents. However, as of today, my job feels very safe.

Dev Leader Weekly

These AI Agents Ain't It

Dev Leader Weekly 94

TL; DR:

These AI Agents Ain't It

The Setup

The First Hurdle: Close, But Not Really

Maxing Out The Potential

Final Results

Next Steps

Discussion about this post