Putting a number on it
For most of the time I spent building the AI that does my job, I had no way to put a number on how well it was working. So I built a page that tracks one number: the minutes of my time it saves. Every action priced, graded down by what I actually did with it.
For most of the time I've spent building the AI that does my job, I had no way to put a number on how well it was working. I could feel it saving me time, but I couldn't have told you how much, or which parts were earning their keep, or whether the feature I'd spent a week on was even worth building.
That's a dangerous thing not to know, because the features that feel valuable and the features that are valuable are not the same set. Something can look and feel great, get used every day, and save me almost nothing, while something quieter and less impressive hands back an hour a week, and from the inside the two are impossible to tell apart.
So I built a page that tracks one number, the minutes of my time the system has saved. Every action it takes gets priced in that single currency. A reply I'd have spent six minutes writing is worth six minutes, a meeting it kept off my calendar is worth however long that meeting would have run, and a research question I'd have chased across Salesforce and old email threads for twenty minutes is worth twenty.
Every one of those numbers is an estimate, part measurement and part guess. The system clocks the real things it can, how long a deflected meeting would have run, how much of a draft I actually rewrote, but the base values are my read on what the work costs by hand, so the total could be off by a fair bit. That matters less than you'd think, because I measure the same way every week, so even when the absolute figure is off, the trend still tells me whether the system is improving.
Measuring it honestly
A number like this is only useful if it refuses to flatter me, and most dashboards flatter by design. So the price gets graded down by what I actually did with the output. A draft earns its full six minutes only if I send it untouched, earns most of them if I edit it lightly, and earns almost nothing if I rewrite half of it, because at that point the work was mine. Reject it outright and it goes negative, since a bad draft cost me the time to read it and gave me nothing back.
Whole categories earn zero on purpose. Sorting my inbox is worth nothing, because triage costs me about two seconds an email and crediting it would bury the real signal in noise. A meeting brief earns nothing until I can prove I actually opened it. I would rather the number sit too low and stay trustworthy than climb on work I can't show I used. The losses and the zeros are actually the most useful part of the page, because they show me exactly where the system is still pretending to help.
It turns into a game
I didn't build any of this to be fun, but a number that climbs pulls at the same part of my brain that wants to top a leaderboard. The difference from an ordinary game is that the points are real, each one is time I got back and spent on the business instead of on my inbox. Building it is a game and so is using it, and both move the company forward. The fun is looking at the number every day and working out how to beat it tomorrow.
Where the next fix comes from
Once every action is priced and graded, I can ask the number where my time is going. It shows me the senders whose drafts I keep rewriting and the categories where the gap between what a task should be worth and what it actually earns is widest. That gap is a ranked list of what to fix next, built from how I actually behave instead of from my opinion about what's broken. In a prior life, picking the next feature would have meant a planning meeting, settled by seniority and whoever argued hardest. It's just me now, and the score becomes my objective function, the thing I prioritize against when I decide what to build.
The system does the asking
Every day the system goes through its own scoreboard, finds the categories that are underperforming, and where it has enough evidence to be sure, writes up the fix and leaves it on the page as a suggestion. I open the page and a short list of proposed improvements is already waiting, each one pointed at a real loss in the data, none of which I went looking for.
For now it stops at the suggestion. The system proposes, I approve, and the change gets built. Letting it close that loop on its own, finding the problem and shipping the fix with me out of the middle, is clearly where this goes, and I'm holding it at semi-automatic until I trust the suggestions enough to stop reading every one, which is a call about risk rather than a limit of the technology. The machinery to make it automatic already exists.
A few months ago I couldn't tell you which parts of this were worth anything. Now I can, and two things came out of it. The first is how wrong I'd been, the features I was proudest of turned out to be worth almost nothing while a few quiet ones were carrying me. The second is that putting a number on it made the work fun. Most games eat the time you put into them, but this one hands it back, and it only rewards the work that was actually worth doing.
Next in AI × CEO: The to-do list that maintains itself
-Evan