The Loop: Why Some Get Magical Results With AI Agents and Others Just Get Garbage
âThe reason agentic work feels âmagicalâ for some people and âuselessâ for others is rarely the model. Itâs the loop.â â nibzard, Agentic Handbook
Picture two developers. Same tech stack, same model, same IDE. One ships a complete feature in three hours. Tests included, edge cases covered, ready for review. The other has been struggling for two days. His agent hallucinates API endpoints that donât exist, half-implements features, and insists everything works. Nothing compiles.
Iâve been in both scenarios. And the difference had nothing to do with the model.
AI Agents Without a Feedback Loop Are Random Number Generators
The typical experience with AI coding agents goes like this: You hand it a task, the agent produces 200 lines of code, proudly announces âProduction Ready!â, and then you start the application. Errors. Errors everywhere. So you feed the error message back, the agent politely apologizes, changes three lines, produces new errors. After an hour, youâve broken more than you fixed.
Thatâs not a bug. Thatâs the expected behavior of a system without a feedback mechanism.
Anthropic documented exactly this pattern in their own research when they had Claude work on their codebase:
âClaudeâs failures manifested in two patterns. First, the agent tended to try to do too much at once.â â Anthropic Engineering
The agent tries to do too much at once. The context fills up. Features stay half-implemented. And then comes the second pattern, which is even more insidious:
âOne final major failure mode that we observed was Claudeâs tendency to mark a feature as complete without proper testing.â â Anthropic Engineering
The agent says âdoneâ without ever testing. Think about that. Anthropic themselves, the company that builds Claude, documents that their own agent marks features as complete without verifying them. This is not an edge case. This is the default behavior.
Why does this happen? Because there is no mechanism that forces the agent to check its own work. No compiler saying âthis doesnât work,â no test turning red, no gate blocking the way. The agent generates text. Thatâs all it does. Whether that text represents working code, it has no idea, as long as nobody feeds back the execution result.
Imagine a junior developer. First day on the job. No code review, no CI/CD, no tests, no pair programming. He writes code, commits straight to main, goes home. Of course the result will be bad. Not because the developer is stupid, but because the feedback mechanisms are missing.
Thatâs exactly how most people treat their AI agents.
The Loop: The Principle That Changes Everything
Last year, I started systematically tracking how I work with AI coding agents. Which sessions were productive, which ones ended in chaos. The pattern was unmistakable. Every successful session had one thing in common: a closed feedback cycle. Every failed session had one thing in common: none.
Peter Steinberger, founder of PSPDFKit and currently working on OpenClaw, nailed it in the Pragmatic Engineer Podcast:
âThe good thing â how to be effective with coding agents â is always you have to close the loop. It needs to be able to debug and test itself. Thatâs the big secret.â â Peter Steinberger, Pragmatic Engineer Podcast
The big secret. No better prompt needed, no prompt engineering course. The agent needs the ability to check and correct its own work.
Steinberger goes further and explains why code is the ideal use case for AI agents:
âThatâs the whole reason why those models that we currently have are so good at coding but sometimes mediocre at writing â because thereâs no easy way to validate. But code I can compile, I can lint, I can execute, I can verify the output.â â Peter Steinberger, Pragmatic Engineer Podcast
Code has something most other domains donât: objective validation. It compiles or it doesnât. The test is green or red. The linter reports errors or none. No âwell, it depends.â This binary nature makes code the perfect terrain for AI agents. But only if you actually build that validation in.
Without the loop, the workflow looks like this: Prompt in, output out, hope it works. Thatâs a slot machine. Sometimes jackpot, usually not.
With the loop, itâs different: Prompt in, code generated, automatically compiled, tests run, errors caught, agent corrects, tests run again. Green. Not hoping, but knowing.
Thatâs the difference between chance and system.
How to Close the Loop in Practice
The theory sounds obvious. Of course you need feedback. But what does it actually look like when I want to build my next feature tomorrow morning?
Over the past months, Iâve identified three approaches that together close the loop. None of them works alone. All three together change everything.
Requirements as Executable Specifications
The first and most important shift: A requirement is not a vague description. âBuild me a login formâ is not a requirement. Thatâs a wish. And wishes are what most people feed their agents.
A requirement that closes the loop looks different. It has concrete success criteria. âThe login form accepts email and password. On invalid email, an error message appears below the field. On wrong password, a generic error appears, no hint whether the email exists. After three failed attempts, the account is locked for 15 minutes. A successful login sets a JWT token with 24h expiry.â
The crucial point: These success criteria BECOME the tests. Every criterion can be translated into an automated test. And these tests close the loop. The agent writes code, the tests run, theyâre red, the agent corrects, the tests run again, theyâre green. Loop closed.
Every agent session becomes disposable because of this. That sounds counterintuitive, but itâs liberating. If the agent goes completely off the rails, I delete the session and start fresh. The requirement and the tests remain. I lose nothing but a few minutes of compute time.
In a project last week, I built a complex database migration with this approach. 14 success criteria, 14 tests, defined upfront. The agent needed two attempts. The first session hit a dead end after 40 minutes. I deleted it, restarted, and the second session got all 14 tests to green in 25 minutes. Without the predefined tests, I wouldnât have noticed the difference between âalmost doneâ and âcompletely brokenâ until manual testing. Probably hours later.
Steinbergerâs Practical Playbook
Steinbergerâs approach is more radical than mine. He builds his entire architecture so that agents can validate it.
He has agents write tests with every single feature. No feature without a test, no exceptions. He built CLI tools so the agent can debug itself. The agent needs tools for self-correction, and these tools donât exist out of the box.
His âgateâ concept is simple and brutally effective: Linting, building, type-checking, all tests must pass before code is considered done. No manual âlooks good.â Automatic validation.
And then this detail, which impressed me the most:
âEven now for websites, I built the core in a way that can be run via a CLI. So I have this perfect execution loop.â â Peter Steinberger, Pragmatic Engineer Podcast
He builds websites so the core can be run via a CLI. Not because itâs technically necessary, but because it gives the agent a closed feedback loop. He designs his architecture for agent verifiability. Thatâs a fundamental shift in how we think about software systems.
Architecture determines agent effectiveness. Not the prompt. Not the model. The question is: Can the agent verify its own work? If the answer is no, the result will be random, no matter how good the model is.
Plan-Then-Execute Instead of Flying Blind
The third building block comes from Boris Cherny, the creator of Claude Code:
âStart every complex task in plan mode. Pour your energy into the plan so Claude can 1-shot the implementation.â â Boris Cherny, X-Post
The energy belongs in the plan, not the implementation. For a complex task, I start in plan mode. I describe what needs to be built, which files are affected, which dependencies exist, what the success criteria look like. The agent creates a detailed plan. Only when the plan is right does implementation begin.
Steinberger has an interesting counterposition here. He doesnât use a formal plan mode, but has a conversation with the agent. He works toward the problem together with it, iterates on the solution, corrects course. Less formal, more dialogue.
What both approaches share: Deliberate steering instead of blind prompting. The human sets the direction, defines the boundaries, establishes the success criteria. The agent executes and validates. This is not delegation to an autonomous system. Itâs guided collaboration with a very fast, very diligent, but sometimes hallucinating assistant.
The critical mistake I see most people make: They give the agent a complex task and just let it run. No plan, no boundaries, no checkpoints. Thatâs like telling an intern âbuild that featureâ and not checking on them for three days.
The Three Feedback Loops Everyone Should Set Up
Enough theory. Here are the three concrete loops that make the difference. Iâve listed them in order of importance, and yes, the first one sounds ridiculously trivial.
The Compile/Build Loop is the foundation. The agent must be able to compile and run its code. This sounds so obvious itâs almost embarrassing to mention. But reality looks different. According to an analysis by Anthropic, missing build verification steps were one of the most common reasons for agent failures in longer sessions. In many setups, the agent generates code in an isolated environment without access to the build system. It writes TypeScript that never goes through the compiler. It imports packages that arenât installed. Without the build loop, everything is speculation.
The Test Loop is the actual âclose the loop.â Every feature needs tests, and the agent writes them too. The tests directly validate the requirement. When I say âthe API must return a 401 on an invalid token,â that becomes an automated test. If the test is red, the agent knows itâs not done. No room for interpretation, no discussion.
The Lint/Quality Loop catches everything the tests donât cover. Static analysis, formatting, type-checking. Automatic gates that prevent bad code from slipping through. Unused imports, missing error handling blocks, security patterns. This is the third line of defense.
Steinbergerâs gate concept combines all three: All loops must pass before something is considered âdone.â No exceptions. No âweâll fix that later.â This sounds strict, but itâs enormously relieving. I no longer have to review every line. The loops catch the big stuff. My review focuses on architecture decisions and business logic, not missing semicolons and broken imports.
In practice it looks like this: I define my success criteria, the agent generates tests from them, writes the implementation, the build runs, the tests run, the linter runs. All green? Feature done. Something red? The agent sees the error and corrects it. Automatically. Without me having to intervene.
Thatâs the loop. Generate, check, fix, repeat. That simple. That effective.
Why This Is a Mindset Shift, Not a Tool Problem
I constantly see the same discussion: âWhich tool is better? Claude Code or Codex? Cursor or Windsurf?â The question is wrong. Itâs like asking which screwdriver is better when the real problem is that youâre using the wrong screw.
The shift is: From âprompterâ to requirements engineer. From the person who tells the computer what to do, to the person who defines how you know itâs right.
Traditionally, I wrote code and hoped it worked, then tested manually. With AI agents, I first write the definition of âworks,â then let the agent generate code and tests. My work shifts from the WHAT to the HOW-YOU-KNOW-IT-WORKS.
The irony is almost comical. The âoldâ software engineering disciplines, testing, CI/CD, clear specifications, clean architecture, are becoming more important than ever in the AI age. All the things weâve been preaching for years and often ignored. Suddenly theyâre no longer nice-to-have. They are the difference between an agent that delivers magical results and one that produces garbage.
TDD, Test-Driven Development, was an ideal most people didnât practice for years. Now itâs the natural workflow with AI agents. Tests first, then implementation. Not for ideological reasons, but because itâs the only way to close the loop.
Requirements engineering was the unloved stepchild of software development for years. âJust write codeâ was the motto. Now the quality of requirements directly determines the quality of AI output. Bad requirements, bad output. Precise requirements with measurable success criteria, precise output.
Itâs not a coincidence. Itâs a feedback system. And feedback systems need clear signals.
Take your next feature. Before you start the agent, write down the success criteria. Not vague, not âit should work.â Concrete. Measurable. Testable. Have the agent write the tests first. Then the implementation. Watch what happens when the agent can verify its own output.
The difference will be obvious. Not because you wrote a better prompt. Not because you used a better model.
But because you closed the loop.
You donât need a better model. You need a better loop.