← Back to Posts

The Loop: Why Some Get Magical Results With AI Agents and Others Just Get Garbage

The Loop: Why Some Get Magical Results With AI Agents and Others Just Get Garbage

“The reason agentic work feels ‘magical’ for some people and ‘useless’ for others is rarely the model. It’s the loop.” — nibzard, Agentic Handbook

Picture two developers. Same tech stack, same model, same IDE. One ships a complete feature in three hours. Tests included, edge cases covered, ready for review. The other has been struggling for two days. His agent hallucinates API endpoints that don’t exist, half-implements features, and insists everything works. Nothing compiles.

I’ve been in both scenarios. And the difference had nothing to do with the model.

AI Agents Without a Feedback Loop Are Random Number Generators

The typical experience with AI coding agents goes like this: You hand it a task, the agent produces 200 lines of code, proudly announces “Production Ready!”, and then you start the application. Errors. Errors everywhere. So you feed the error message back, the agent politely apologizes, changes three lines, produces new errors. After an hour, you’ve broken more than you fixed.

That’s not a bug. That’s the expected behavior of a system without a feedback mechanism.

Anthropic documented exactly this pattern in their own research when they had Claude work on their codebase:

“Claude’s failures manifested in two patterns. First, the agent tended to try to do too much at once.” — Anthropic Engineering

The agent tries to do too much at once. The context fills up. Features stay half-implemented. And then comes the second pattern, which is even more insidious:

“One final major failure mode that we observed was Claude’s tendency to mark a feature as complete without proper testing.” — Anthropic Engineering

The agent says “done” without ever testing. Think about that. Anthropic themselves, the company that builds Claude, documents that their own agent marks features as complete without verifying them. This is not an edge case. This is the default behavior.

Why does this happen? Because there is no mechanism that forces the agent to check its own work. No compiler saying “this doesn’t work,” no test turning red, no gate blocking the way. The agent generates text. That’s all it does. Whether that text represents working code, it has no idea, as long as nobody feeds back the execution result.

Imagine a junior developer. First day on the job. No code review, no CI/CD, no tests, no pair programming. He writes code, commits straight to main, goes home. Of course the result will be bad. Not because the developer is stupid, but because the feedback mechanisms are missing.

That’s exactly how most people treat their AI agents.

The Loop: The Principle That Changes Everything

Last year, I started systematically tracking how I work with AI coding agents. Which sessions were productive, which ones ended in chaos. The pattern was unmistakable. Every successful session had one thing in common: a closed feedback cycle. Every failed session had one thing in common: none.

Peter Steinberger, founder of PSPDFKit and currently working on OpenClaw, nailed it in the Pragmatic Engineer Podcast:

“The good thing — how to be effective with coding agents — is always you have to close the loop. It needs to be able to debug and test itself. That’s the big secret.” — Peter Steinberger, Pragmatic Engineer Podcast

The big secret. No better prompt needed, no prompt engineering course. The agent needs the ability to check and correct its own work.

Steinberger goes further and explains why code is the ideal use case for AI agents:

“That’s the whole reason why those models that we currently have are so good at coding but sometimes mediocre at writing — because there’s no easy way to validate. But code I can compile, I can lint, I can execute, I can verify the output.” — Peter Steinberger, Pragmatic Engineer Podcast

Code has something most other domains don’t: objective validation. It compiles or it doesn’t. The test is green or red. The linter reports errors or none. No “well, it depends.” This binary nature makes code the perfect terrain for AI agents. But only if you actually build that validation in.

Without the loop, the workflow looks like this: Prompt in, output out, hope it works. That’s a slot machine. Sometimes jackpot, usually not.

With the loop, it’s different: Prompt in, code generated, automatically compiled, tests run, errors caught, agent corrects, tests run again. Green. Not hoping, but knowing.

That’s the difference between chance and system.

How to Close the Loop in Practice

The theory sounds obvious. Of course you need feedback. But what does it actually look like when I want to build my next feature tomorrow morning?

Over the past months, I’ve identified three approaches that together close the loop. None of them works alone. All three together change everything.

Requirements as Executable Specifications

The first and most important shift: A requirement is not a vague description. “Build me a login form” is not a requirement. That’s a wish. And wishes are what most people feed their agents.

A requirement that closes the loop looks different. It has concrete success criteria. “The login form accepts email and password. On invalid email, an error message appears below the field. On wrong password, a generic error appears, no hint whether the email exists. After three failed attempts, the account is locked for 15 minutes. A successful login sets a JWT token with 24h expiry.”

The crucial point: These success criteria BECOME the tests. Every criterion can be translated into an automated test. And these tests close the loop. The agent writes code, the tests run, they’re red, the agent corrects, the tests run again, they’re green. Loop closed.

Every agent session becomes disposable because of this. That sounds counterintuitive, but it’s liberating. If the agent goes completely off the rails, I delete the session and start fresh. The requirement and the tests remain. I lose nothing but a few minutes of compute time.

In a project last week, I built a complex database migration with this approach. 14 success criteria, 14 tests, defined upfront. The agent needed two attempts. The first session hit a dead end after 40 minutes. I deleted it, restarted, and the second session got all 14 tests to green in 25 minutes. Without the predefined tests, I wouldn’t have noticed the difference between “almost done” and “completely broken” until manual testing. Probably hours later.

Steinberger’s Practical Playbook

Steinberger’s approach is more radical than mine. He builds his entire architecture so that agents can validate it.

He has agents write tests with every single feature. No feature without a test, no exceptions. He built CLI tools so the agent can debug itself. The agent needs tools for self-correction, and these tools don’t exist out of the box.

His “gate” concept is simple and brutally effective: Linting, building, type-checking, all tests must pass before code is considered done. No manual “looks good.” Automatic validation.

And then this detail, which impressed me the most:

“Even now for websites, I built the core in a way that can be run via a CLI. So I have this perfect execution loop.” — Peter Steinberger, Pragmatic Engineer Podcast

He builds websites so the core can be run via a CLI. Not because it’s technically necessary, but because it gives the agent a closed feedback loop. He designs his architecture for agent verifiability. That’s a fundamental shift in how we think about software systems.

Architecture determines agent effectiveness. Not the prompt. Not the model. The question is: Can the agent verify its own work? If the answer is no, the result will be random, no matter how good the model is.

Plan-Then-Execute Instead of Flying Blind

The third building block comes from Boris Cherny, the creator of Claude Code:

“Start every complex task in plan mode. Pour your energy into the plan so Claude can 1-shot the implementation.” — Boris Cherny, X-Post

The energy belongs in the plan, not the implementation. For a complex task, I start in plan mode. I describe what needs to be built, which files are affected, which dependencies exist, what the success criteria look like. The agent creates a detailed plan. Only when the plan is right does implementation begin.

Steinberger has an interesting counterposition here. He doesn’t use a formal plan mode, but has a conversation with the agent. He works toward the problem together with it, iterates on the solution, corrects course. Less formal, more dialogue.

What both approaches share: Deliberate steering instead of blind prompting. The human sets the direction, defines the boundaries, establishes the success criteria. The agent executes and validates. This is not delegation to an autonomous system. It’s guided collaboration with a very fast, very diligent, but sometimes hallucinating assistant.

The critical mistake I see most people make: They give the agent a complex task and just let it run. No plan, no boundaries, no checkpoints. That’s like telling an intern “build that feature” and not checking on them for three days.

The Three Feedback Loops Everyone Should Set Up

Enough theory. Here are the three concrete loops that make the difference. I’ve listed them in order of importance, and yes, the first one sounds ridiculously trivial.

The Compile/Build Loop is the foundation. The agent must be able to compile and run its code. This sounds so obvious it’s almost embarrassing to mention. But reality looks different. According to an analysis by Anthropic, missing build verification steps were one of the most common reasons for agent failures in longer sessions. In many setups, the agent generates code in an isolated environment without access to the build system. It writes TypeScript that never goes through the compiler. It imports packages that aren’t installed. Without the build loop, everything is speculation.

The Test Loop is the actual “close the loop.” Every feature needs tests, and the agent writes them too. The tests directly validate the requirement. When I say “the API must return a 401 on an invalid token,” that becomes an automated test. If the test is red, the agent knows it’s not done. No room for interpretation, no discussion.

The Lint/Quality Loop catches everything the tests don’t cover. Static analysis, formatting, type-checking. Automatic gates that prevent bad code from slipping through. Unused imports, missing error handling blocks, security patterns. This is the third line of defense.

Steinberger’s gate concept combines all three: All loops must pass before something is considered “done.” No exceptions. No “we’ll fix that later.” This sounds strict, but it’s enormously relieving. I no longer have to review every line. The loops catch the big stuff. My review focuses on architecture decisions and business logic, not missing semicolons and broken imports.

In practice it looks like this: I define my success criteria, the agent generates tests from them, writes the implementation, the build runs, the tests run, the linter runs. All green? Feature done. Something red? The agent sees the error and corrects it. Automatically. Without me having to intervene.

That’s the loop. Generate, check, fix, repeat. That simple. That effective.

Why This Is a Mindset Shift, Not a Tool Problem

I constantly see the same discussion: “Which tool is better? Claude Code or Codex? Cursor or Windsurf?” The question is wrong. It’s like asking which screwdriver is better when the real problem is that you’re using the wrong screw.

The shift is: From “prompter” to requirements engineer. From the person who tells the computer what to do, to the person who defines how you know it’s right.

Traditionally, I wrote code and hoped it worked, then tested manually. With AI agents, I first write the definition of “works,” then let the agent generate code and tests. My work shifts from the WHAT to the HOW-YOU-KNOW-IT-WORKS.

The irony is almost comical. The “old” software engineering disciplines, testing, CI/CD, clear specifications, clean architecture, are becoming more important than ever in the AI age. All the things we’ve been preaching for years and often ignored. Suddenly they’re no longer nice-to-have. They are the difference between an agent that delivers magical results and one that produces garbage.

TDD, Test-Driven Development, was an ideal most people didn’t practice for years. Now it’s the natural workflow with AI agents. Tests first, then implementation. Not for ideological reasons, but because it’s the only way to close the loop.

Requirements engineering was the unloved stepchild of software development for years. “Just write code” was the motto. Now the quality of requirements directly determines the quality of AI output. Bad requirements, bad output. Precise requirements with measurable success criteria, precise output.

It’s not a coincidence. It’s a feedback system. And feedback systems need clear signals.


Take your next feature. Before you start the agent, write down the success criteria. Not vague, not “it should work.” Concrete. Measurable. Testable. Have the agent write the tests first. Then the implementation. Watch what happens when the agent can verify its own output.

The difference will be obvious. Not because you wrote a better prompt. Not because you used a better model.

But because you closed the loop.

You don’t need a better model. You need a better loop.