What are the AI Blacksmiths Missing?

Published on 1 December 2025

Along with every other software engineer, I've had no end of discussions about whether AI can create reliably good software. These discussions often fracture into two camps. Generally – and this is a generalisation – there are those who consider AI more capable than humans and consequently delegate it responsibility for the implementation (let's call them Alchemists), and those who trust it much less, and consequently use it as a tool (let's call them Blacksmiths).

Currently, I am a Blacksmith. The purpose of this post is not to make a case for my side, but to solicit wisdom from the other. I find these discussions often drift into the rhetorical, and then into attempts to give meaning to the other camp's flawed conclusion ("they're inexperienced", or, "they're in denial"), perhaps missing an opportunity to discover an entirely new perspective.

Side note: As an early engineer I threw my lot into a set of technologies that were, at the time, demonstrably inferior to the status quo. Experienced engineers did not fail to point this out to me, and I, in turn, perceived them to be condescending. They were correct that I had overestimated both the technology and my own capabilities; but I was eventually correct.

Anyway, to ground this discussion in reality, I will present a couple of real work example using Claude Opus 4.5 to make the basis of my position transparent. You can then tell me if these examples are a good test, or if I need to approach AI in a different way.

Both of these examples were tasks for act.cool. I chose front-end tasks as Claude has a reputation for excelling at this kind of work, and that was what I was working on last night.

I executed both tasks via OpenCode using a plan then execute approach, and with Anthropic API pricing.

Task 1: Break each landing page section into individual components.

Context: A simple task to clean up some hacky code written by Claude Sonnet 4.5. The prompt was intentionally left high-level rather than spoon-feeding the model precise principles for encapsulation. My assumption is that it pays dividends to clean up code periodically, otherwise the codebase becomes a more challenging target for both humans and models.

Estimated human engineering time: less than 1 hour.

Inference Cost: $2.60.

Performance: Moderate. It successfully split the "everything component" into smaller components without breaking anything, but it misidentified some of the boundaries. Nonetheless, this was a good start from which I could take over.

Task 2: Refactor the chat application to use sticky positioned elements instead of relying on scroll jacking.

Context: I chose this task because I thought it could be quite gnarly, would interact with multiple non-trivial components, and would be difficult for a model to verify.

Estimated human engineering time: between one hour and one day.

Inference cost: $4.80

Performance: The model correctly mapped the surface area of the problem, and presented a well-reasoned plan for how to perform the task. However, it also proposed one change that was out of scope, and another that involved forking a downstream library. After I had corrected the plan, and found on Github issues a workaround that would save the model forking the library, the model executed the solution with only three small bugs. After pointing out the bugs, the model was able to fix one bug, but also introduced a regression by deleting some functionality necessary for preserving layout when a user has a scroll-wheel mouse.

I was very happy with the performance of the model on both tasks – in both cases, the solutions were a good start that saved me some time and thinking.

However, I could not use a model in this way to delegate responsibility, and achieve an acceptable quality level.

So, what do I need to do differently to maximise model autonomy? Should I run the model in a while loop and multiply my inference spend? But if I do that, how does the model verify that it has completed its task? Should I care less about clean code and focus more on the delivered functionality? If I do that, what steps do I have to take to ensure that the performance of the model doesn't degrade as the codebase becomes more cluttered?

I obviously remain sceptical at this point, but I also don't know what I don't know, so if you have a perspective that would change the way I'm thinking about this, I'd love to read it.

Thanks for reading! If you enjoyed this, subscribe for free to get next month's post delivered direct to your inbox.