AI × CEO

The software that fixes itself

EVAN REISER / MAY 4, 2026 / 6 MIN READ

The system is large, it changes every day, and none of its code was written by hand. Something that big should drown in its own bug count by month two. It hasn't, and the reason is a loop that finds, fixes, and learns from its own bugs.

In previous posts, I've described a system that does more and more of my job. It decides what my inbox needs and drafts the replies, it preps every meeting before I walk in, and it picks the direction of a reply before I do. Each feature also made the thing bigger, with more surface area, more moving parts, and more places to break without me noticing.

The system is large, it changes every day, and none of its code was written by hand. Every line came out of Claude Code, built by one person with a fleet of AI agents over a few months and no QA team behind any of it. Something that big, getting more complex by the week, should drown in its own bug count by month two.

It hasn't, and the reason is the self-improving loop I keep coming back to in this series, now turned on the system's own code instead of my inbox.

How that loop works is most of what follows. But it runs on bugs, and a codebase changing this fast produces them constantly, so it starts with the one thing it needs from me: a dead-simple way to flag a problem the moment I see it. From there the system carries each bug the rest of the way on its own, which is what lets one person keep something this large alive.

Reporting a bug is one keystroke

The one-key bug capture modal, with a screenshot already taken and a single field asking what's wrong

Hit one key anywhere in the app and the screenshot is already captured. One line on what's wrong, submit, and you're back to work.

When I'm using the dashboard and something looks wrong, I press one key. A box appears with a screenshot already taken and a single field that asks what's wrong. I type one line, hit submit, and I'm back to what I was doing. The whole thing takes about a second, and it never pulls me out of my own workflow to go file a ticket somewhere else.

It looks like a screenshot, but it captures far more than the pixels. The browser ships the entire page back to the server, the live document state and the scroll position and the last twenty errors the page logged to the console, and the server rebuilds the page closely enough to debug from. So the report doesn't just say "this looks broken." It carries the real state of the page and the actual errors that fired, which is most of what a human debugger would spend the first ten minutes collecting by hand. The cost of telling the system something is wrong dropped to near zero, which means I report ten times more than I ever would have with a real ticketing tool. That volume turns out to be the fuel for everything downstream.

The board drains itself

The bug board, with cards sliding from New to In Progress to Fixed on their own

Every bug lands here and the cards slide from New to In Progress to Fixed on their own. Anything the loop can't close moves to Needs Evan with a diagnosis and a failing test attached.

Every bug lands on the bug board. New on the left, then in progress, then fixed. Every ten minutes a fixer wakes up, looks for anything in the new column, and goes to work. Roughly seventy percent of them get resolved with no human ever looking at the code, mostly the small rendering and state bugs a fast codebase throws off all day. I watch the cards slide left to right on their own while I do something else.

The reason it works is a protocol that took me a few tries to get right. The naive version, "here's a bug, go fix it," landed around a twenty percent success rate and lied to me constantly about the rest. Usually it would invent a plausible problem, fix that, and never touch the one I flagged. The version running now forces a fixed sequence. First the agent has to reproduce the bug by writing a test that fails, which proves the bug is actually real before a single line of code gets touched. Then it diagnoses the root cause and has to commit to a verdict, real bug or false alarm. Then it fixes the bug and reruns the exact same test, which now has to pass.

The last step is a second, independent AI agent that reviews the fix and is allowed to reject it. I learned the hard way that the agent which wrote the fix cannot be the agent that checks it. By the time a session has fixed a bug, it has spent its entire context convincing itself the fix is correct, so asking it to audit its own work is theater. I tried having the fixing agent review its own work. It approved its own fixes ninety-eight percent of the time. A separate reviewer, with none of that history in its head, rejects about fifteen percent of fixes, and almost every rejection is the same sin: the fix wrapped the bug in error handling to make the symptom disappear instead of fixing the cause. The auditor has to be a stranger to the work. That single structural rule, separate the writer from the reviewer, did more for quality than any amount of better prompting.

The auditor has to be a stranger to the work.

And the protocol is enforced in code, not in a prompt's good intentions. A bug physically cannot be marked fixed unless the verification step ran and passed, and the agent has no way to fake that or route around it.

A robot that goes looking for trouble

Filing bugs by hand only catches what I happen to notice, so a second agent goes looking for the ones I don't. Every hour it walks through every page of the dashboard, measures how fast each one loads, watches the console for errors, and looks at the page the way a person would, to catch the breakage that loads fine and logs nothing but still looks wrong. Whatever it finds, it files onto the same board the fixer drains. So the board fills from two directions, the bugs I catch and the ones it turns up on its own.

It took some tuning to get there, because the early versions were too trigger-happy. One crawled the dashboard during a restart, saw every page come back empty, and filed seventy-seven phantom bugs reporting that the whole app had vanished. Every mess like that becomes a guardrail, so the crawler keeps getting harder to fool. That one, for example, taught it to back off when too many pages look broken at once, on the assumption that the problem is the building and not the rooms.

When it gets stuck, it hands me a running start

The other thirty percent, the ones the fixer can't close on its own, don't just disappear. They move to a column called "Needs Evan." When a bug lands there it arrives with a written diagnosis, a test that reproduces it, and a log of everything already tried. The slow, boring part of any bug, reproducing it and finding the root cause, is already done, so when I pick one up the fix usually takes minutes.

The fixes that need me aren't wasted either. Whenever a bug lands in "Needs Evan" and I solve it, what I learned gets folded into the playbook the fixer runs, so the next time that kind of bug shows up it has a better shot at closing it on its own. The system doesn't rewrite its own playbook unsupervised yet, so for now that update is mine to make by hand, and automating it is the next thing I want to hand off. Even so, every escalation leaves the next version a little harder to stump.

The flywheel

Put the pieces together and you get a flywheel. Easier reporting means more bugs surface, more bugs give the fixer more reps, and more reps make it better at the next one. The pile of small breakage that slows most codebases to a crawl never gets the chance to form here, because the loop clears it as fast as it shows up.

It's the same loop from the last post, only now it runs on the system's own code instead of my decisions, and the share of bugs that ever reach me keeps shrinking. That is how one person runs something this large with no team. Whether I'm at the keyboard or asleep, it just keeps fixing itself.

Next in AI × CEO: Putting a number on it

-Evan