Back to all essays
The Only Thing in My Codebase That Can't Lie
Engineering 10 min read

The Only Thing in My Codebase That Can't Lie

I asked an agent a simple product question and watched it spend three thousand words rediscovering things it had no way to trust. The fix wasn't a better map. It was noticing which artifacts in a codebase can lie to you — and which one can't.

NC

Nino Chavez

Product Architect at commerce.com

The failing tests said mobile. A handful of red checks, all tagged to the phone viewport, all complaining that some element wasn’t on the page.

So I had the agent run them on desktop too. They failed there as well.

That single fact changed the question. These weren’t mobile bugs. The element the tests were looking for — a pricing label, a testimonial block — wasn’t missing because the layout broke on a small screen. It was missing because someone had redesigned the page, and the tests still expected the old one.

Which means the real question was never “how do I make this green.” It was: did the test drift away from a product that deliberately changed, or did the product actually break?

You cannot answer that by reading the code. Both look identical from the code’s side — an assertion that no longer matches what’s rendered. To know which one is wrong, you have to know what someone meant. Whether the testimonial was deleted on purpose.

That’s the one thing you can’t grep for.


Two Sources of Truth

I want to name the thing that test was sitting on top of, because it turned out to be everywhere.

A codebase carries two different kinds of truth.

There’s what the code does — recoverable any time. You can read it, run it, grep it. It’s never out of date with itself.

And there’s what you meant — the intent, the why, the deliberate choice behind the shape. That one isn’t in the code. It lives in a person’s head, or in some artifact a person wrote down once and then walked away from.

These two drift apart continuously, and nothing forces them back together. The drifted test was just the first place I watched the gap show.

It was not the last.


The Same Gap, Four More Times

A few hours into the same session, the unit-test suite was passing. Nine hundred and seventy tests, all green, build clean. And every authenticated page in the app crashed on load — a component caught in an infinite update loop, shipped because it had passed the gate.

The green checkmark was telling the truth about what it measured. What it measured just wasn’t whether the product worked.

Then the database. Its change history was a list of files, each one marked “applied.” Several contained instructions that could never have run as written — they described changes to tables that weren’t there. The record said the schema was one thing. The running database was another. That record had been confidently wrong for who knows how long, and nothing had noticed, because nothing re-checked.

Then the brand. Four separate documents described the product’s design language — a context file, a design doc, a style guide, the guide’s own labels. Each had frozen at a different era of a rebrand the product had been through more than once. Exactly one artifact tracked what the product actually looked like: the stylesheet, because the stylesheet is the thing that renders.

When I asked the agent for a design recommendation, it reached for the most authoritative-looking document and proposed bringing back a color we’d abandoned two rebrands ago. It wasn’t wrong about what the document said. The document was wrong.

And then the one that should have been the safeguard. A visual audit — the tool whose entire job is to catch exactly this kind of drift — came back green. It had only ever looked at one page, a component gallery, and never walked the actual product. Zero of the admin screens appeared in its report.

The instrument built to detect lying artifacts was itself a lying artifact.

Four artifacts. Four different corners of one system. One disease.

Each was a frozen assertion — true at the moment it was written, trusted long after, re-checked by nothing.


The Membrane

There was one artifact in the whole session that didn’t have this disease. The test that started it.

A test is the only thing in a codebase that states what you meant and then proves it against what the code does — every single time it runs. A document asserts once and freezes. A test asserts and keeps proving.

A document states the truth once and freezes. A test states it and keeps proving it. That’s the difference between a record and a membrane.

That’s why the drifted test felt different from the lying brand doc, even though both were technically “out of date.” The test announced its disagreement the instant it appeared. It failed loudly.

The brand doc, the migration record, the green audit — they kept their disagreement to themselves and let me go on trusting them.

A document that lies silently is worse than no document at all — because you act on it.


So Build a Better Map?

The obvious response, after an agent spends three thousand words re-deriving how an app fits together, is to build it a map. An atlas of the system it can read instead of rediscovering every session.

I argued the agent out of it.

A structural map is the wrong thing to maintain. The agent is already good at structure — it can grep, it can trace, it can reconstruct where things live in seconds, and that reconstruction is never stale. A hand-maintained map of the same territory is strictly worse, because the moment it drifts it doesn’t just go useless. It becomes trusted and wrong.

The most dangerous artifact in a codebase isn’t the one that’s missing. It’s the one that’s confidently out of date and looks authoritative.

What grep can’t surface is the other category — the landmines. The column that’s written but never read. The guard that has to run in a specific order or everything breaks silently. The behavior that’s true only because someone already stepped on it once. Those don’t live in the structure. They live in whoever stepped on them first — and wrote nothing down.

The checks that block bad output. The joins that show where independent systems disagree. The traces that tie code back to the intent it was supposed to serve — the apparatus I’ve called scaffolding before. What I didn’t have a clean way to say then is why it’s the work: those are the artifacts that can’t lie. Everything else in the repo is a claim. These carry their own proof.


I Sold the Wrong Number

I built the registry. Then I made the mistake of testing my own claim.

I’d predicted what it would buy: a fresh session reads two entries, skips the rediscovery, reaches the same answer in a fraction of the work. So I opened a clean session, gave it nothing but the original question, and watched.

Three things happened. Only two were what I’d sold.

The wiring fired. The cold session read the registry first, unprompted, before it touched the code.

The answer got better. Its core argument came back as a direct citation of a registry entry — tighter and more sure of itself than the one I’d reasoned my way to the first time, and it arrived in a single turn.

And it cost exactly as many tokens as before. About a hundred and twelve thousand, either way. After reading the registry, the agent went and verified the whole subsystem against the source anyway.

I’d sold the wrong number. The registry’s win was never speed.

The reason it wasn’t is the entire point of everything above. An agent that reads a document and trusts it instead of checking is the exact failure this post is about. The re-verification I’d filed in my head as waste — the rediscovery tax I wanted to eliminate — was the agent refusing to do the one thing that makes documents dangerous.

You don’t want it to skip the check.

You want to make the check cheap. Confirm the handful of artifacts a claim points at, instead of re-deriving the whole system from scratch. That’s a real efficiency lever. “Trust the doc, skip the verification” is not. That’s the bug wearing a feature’s clothes.

So the honest version of the claim is narrower than the one I started with: a better answer, reached with more confidence, in one turn. Not fewer tokens. I’ve measured the quality. The cost-collapse version — verification made genuinely cheap — I haven’t earned yet.


Where It Stops Being Clean

So the move looks obvious: make every claim carry a verifier. A fact about what the code does, with no test proving it, isn’t a record — it’s a liability. Promote it to a test, or stop trusting it.

Except it doesn’t close all the way, and I want to be honest about where it frays.

Not everything can be verified against the code. Why we chose a three-way setting instead of a simple on/off — that isn’t a fact about what the code does. It’s a fact about a decision someone made on a particular day. You can’t write a test that proves a rationale. There’s no assertion for “we meant it this way.” That kind of truth can only be dated and appended to, never re-checked, because its source isn’t the code. It’s history. And you don’t re-run history.

And the registry that came out of that session — a list of the landmines — doesn’t fully obey its own rule yet. Some entries are tests that can’t lie. Others are still prose pointing at other prose, inheriting the exact drift they were built to kill. I found the pattern. The tooling is still catching up to it.

There’s a larger version of this I keep circling and haven’t landed.

If the real source of intent isn’t any of these artifacts but the conversations that produced them — the sessions where I tell an agent what I mean and it generates the code, the tests, the docs in response — then the whole chain hangs off something that was never written down to begin with. Every artifact downstream is a generated assertion about an intent that lives only in a transcript nobody re-reads.

I don’t know yet whether that chain can be kept honest, or whether the un-verifiable part — the why — just rots no matter how carefully you date it.

What I’m sure of is the smaller claim. Most of what a codebase tells you about itself is a frozen assertion, decaying from the moment it was written. Find the few artifacts that re-prove themselves on every run. Trust those. Treat the rest as exactly what they are.

The part that loops back on itself: this session — the bug, the registry, the number I got wrong — is the raw material. And what it produced is a system for catching exactly this kind of knowledge before the conversation ends and it’s gone. The post makes its own case. The durable artifact was never the code. It’s the landmines and the why — what a smarter model still can’t recover once the transcript scrolls away.

The green checkmark was never the problem.

Believing it was.

Share:

More in Engineering