The Gate Verifies the Work. It Never Looks at the Plan.
Thirteen features shipped green. The designs were all wrong. The system made being wrong cheap—but only on one side of the work.
Nino Chavez
Product Architect at commerce.com
A clean run should feel like a win. Thirteen features built one after another, every test green, all of it shipped. The process worked.
So why did it leave me uneasy?
Because the designs were wrong. All of them. And the only reason that didn’t matter is that the system is built to make wrong cheap—but only on one side.
Let me show the work.
What “Done” Stops Being
The build process I run has one good idea at its center. “Done” isn’t a status someone sets. It’s a test that passes against the real database.
Not a checkbox. Not a tag on a file. Not an agent reporting success. A behavioral test, run against a real schema, green.
That sounds small. It isn’t. It removes the entire category of argument that usually eats engineering time—the is this actually finished? conversation. You don’t debate it. You write the test and the machine answers. Green means verified. Red means it’s broken, or it was never built. There’s no third state where someone’s confidence stands in for evidence.
Across thirteen features, that property held the line. The work stayed honest because nothing was allowed to claim done. It had to demonstrate done.
The Turn That Made Me Notice
Two features shipped in one stretch: recalculating tax at renewal, and recalculating shipping at renewal. Both went green. Both shipped. And inside that one stretch were two bugs.
The first one, the system caught.
A schema change forced me to rebuild a table, and the rebuild silently dropped two columns—the quiet kind of data loss where nothing nearby complains. What caught it was a rule: when you change a table, run every test that reads it, not just the ones you’re working on. Those tests went red—honestly red, the good kind—and the bug surfaced as a loud failure instead of a quiet corruption. (That rebuild is a whole story on its own.)
The second bug, the system could not catch.
The tax feature calls an external payments API. The design told me the request field was named one thing. It was named another. I only know because I checked the live API before I wrote the code.
Here’s the part that matters: the test for that feature fakes the external call. On purpose. Hitting a third-party API inside an automated test is slow and flaky, so the test stubs the response. Which means the test cannot see a wrong field name. The feature passes green with the wrong field and breaks the instant it touches the real API.
Every Gate Faces the Code
Pull back and look at where the checks point. The test suite checks the code. The schema validator checks the code. The pre-merge bot checks the code. The whole apparatus is aimed downstream, at the thing I produced.
Nothing is aimed at the thing I started from.
The design—the plan I built against—got no scrutiny at all. And the design was generated by an AI, the same kind of system the gates exist to police. I trusted the plan and verified the output. I had it backwards from where the risk actually lived.
Because the plan was wrong in the same ways every time. It cited stale migration numbers. It specified a database constraint the project had explicitly banned. It pointed at a shared file that doesn’t exist. It used a test helper that quietly breaks the moment you give it more than one record. None of that is exotic. It’s the predictable drift of a plan written against a codebase it doesn’t actually live in.
The gates caught the consequences of those errors—when they turned into bad code that failed a test. They never caught the errors themselves. The only thing standing between a wrong plan and a shipped bug was me, already knowing the codebase well enough to distrust the plan on sight.
I trusted the plan and verified the output. The risk was in the plan.
What I’m Doing Differently
The fix isn’t to stop generating plans with AI. The plans were ninety percent right and saved an enormous amount of time. The fix is to stop treating the plan as authority and start treating it as a draft from a confident stranger.
Two concrete changes. If a feature talks to an external service, I verify the real interface against what the design claims before I write code—every time—because the test can’t. And the parts of the design most likely to be stale, anything that asserts a fact about the codebase’s own structure, get checked against the codebase instead of believed.
Both of those are still habits. That’s the uncomfortable part. They live in my head and a notes file, not in a gate. A faster pass, a less suspicious day, and the false-green ships.
The Lesson Is About Direction
Trust but verify is the oldest advice in this work. What this run taught me is that I’d only built half of it. I verify output exhaustively. I verify input by remembering to.
The mechanical half is genuinely good. The advisory half is where every real bug this week actually lived.
So the question I’m holding isn’t whether to trust AI-generated plans. It’s narrower and harder: what would it take to point a check upstream—at the plan—instead of only ever at the work? The gate that catches a broken test is easy. The gate that catches a confidently wrong instruction, before I build on it, is the one I don’t have yet.