Back to all posts
The 48-Hour Artifact
AI & Automation 11 min read

The 48-Hour Artifact

I pointed an AI agent at a product initiative and didn't write a single line of code or a single word of copy. Two days later: 11 prototype pages, 4 strategic documents, cross-industry research with 30+ citations. And a trust problem I still can't solve.

NC

Nino Chavez

Product Architect at commerce.com

Two weeks ago I ran an experiment. A product initiative needed strategic documents for leadership alignment and an interactive prototype to demonstrate the concept. The kind of work that normally takes a cross-functional team two to three weeks — strategy, research, design, development, validation.

I wrote prompts. Hundreds of them, over 48 hours. “Implement this plan.” “That reads like a blog.” “Is this actually true?” “What about utility billing portals?”

The AI did the rest. Every line of code, every word of copy, every design decision, every research finding. Two days later: 11 interactive prototype pages deployed to Vercel, an AI-powered support agent with 8 tools and 6 demo personas, 4 strategic documents validated against the production codebase, cross-industry research covering 14 platforms with 30+ cited sources, and a design system extracted from screenshots of the existing product.

I didn’t write a line of code. I didn’t write a word of document copy. I didn’t create a design spec.

So why did it take me to do it?


Hour 1: The Wrong Voice

The content generation pipeline I’d built has a ghost writer, copywriter, and editor stage. It produced documents that read like thought leadership pieces — narrative arcs, provisional hedging, rhetorical questions. Signal Dispatch voice.

The feedback I gave myself: these sound like blog posts, not strategy docs. An executive reading this wants conclusions first, evidence second, tables for data, bullets for facts. Not “here’s where I’ve landed.”

Three prompt changes fixed the surface problem. But the deeper issue was that the pipeline assumed a single voice. Internal strategy documents need the opposite of public writing — directness over narrative, assertion over exploration.

That realization didn’t come from the AI. It came from years of knowing what lands in a boardroom versus what lands on a blog.


Hour 3: The Wrong Components

The prototype looked impressive. Colored alert banners, progress bars, type tags, inline explainer boxes with monospace calculation blocks.

None of these existed in the actual product.

I’d fed the agent screenshots of the existing UI. It used them as inspiration instead of constraint. The existing product uses white cards, simple tables, green badges, blue links, and label-value pairs. No colored alerts. No progress bars. No explainer boxes.

This led to the first design principle I had to enforce: match the existing product. If a component doesn’t exist today, mark it as PROPOSED. Don’t present aspirational UI as buildable reality.

Stripping the prototype back to reality meant replacing every invented component with something that actually ships in the product’s design system. The agent could do the replacement once told. But it couldn’t tell the difference between invention and reality on its own.


Hour 5: The Wrong Words

“Surcharge.” “Non-preferred gateway.” “Downgrade.” “BLOCKED.” “You exceeded your cap.”

Every one of these words creates user anxiety. “Surcharge” sounds like a penalty. “Non-preferred” implies the user made a bad choice. “Downgrade” frames a cost-saving action as a demotion.

I asked: should the action buttons just say “Select” and let the price and position convey direction? The agent had been writing “Upgrade” and “Downgrade” because that’s what the domain language dictated. It took a human to notice that domain language and user-facing language serve different purposes.

That reframing triggered a broader audit: 120+ replacements across 16 files. “Surcharge” became “processing fee.” “Non-preferred” became “third-party.” “Cap” became “limit.” Every cost-related message was rewritten to lead with savings, not charges.

The agent executed the audit flawlessly once it had the rules. It just couldn’t generate the rules on its own.


Hour 8: The Research Nobody Planned

I asked: “Would it help to look at utility billing portals?”

This wasn’t in the original plan. But it produced some of the strongest evidence in the entire project:

  • Utility portals achieve 25-65% call center reduction with self-service billing
  • Only 17% of customers report “very good” bill understanding
  • A major telecom’s AI deployment cut call center costs by 90% with fine-tuned small models
  • Multiple cable companies have faced $10M-70M settlements for adding fees during fixed-rate contracts

Then I dropped 42 screenshots from SaaS products I use — Vercel, Resend, Supabase, Cloudflare. The agent extracted specific UI patterns: grouped invoice line items, billing tab navigation, upcoming invoice previews, plan selection cards.

None of these research spikes were planned. They came from me seeing the output and thinking “what else should we look at?” This is the 25-30% of content quality that the agent can’t produce by itself.


Hour 12: The Credibility Check

I asked the agent to validate its own claims against the production codebase and screenshots. It found problems:

  • “Self-service plan changes: No” was wrong — self-service upgrades already existed. Only downgrades were missing.
  • A feature framed as Phase 3 future work already had a working flow in production.
  • “Usage dashboard: No” was misleading — an existing page already showed trailing 12-month data.
  • The docs implied invoices lacked line item detail. They didn’t.

If a VP had checked any of these claims against the actual product, they’d have lost trust in the entire document package.

Then I ran the output through a second model for cross-validation. It found: one stat was 92%, not 88%. A claim about a company’s pricing change was more nuanced than stated. A data table had a logic gap — if 85% of cases were uncategorized, how could 59% be classified as a specific type?

The methodology answer resolved the gap. But only because I had access to the original email describing how the classification was actually done.


Hour 18: The Extraction

Everything above happened once. The question became: can it happen again for a different initiative?

I extracted the methodology into a reusable template:

  • Configurable CSS, JS components, and project instructions
  • Slash commands for each pipeline stage
  • Agent definitions for researcher, prototype-builder, doc-writer, and validator
  • A configuration file with execution depth, voice modes, and research scope

Hour 24: The A/B Test

I started a new project from the template, pointed it at the same inputs, and ran it without the iterative human feedback loops.

Result: 70-80% quality in 10-15x less time.

The first run failed on terminology (used “deflectable” instead of “resolvable without support”), didn’t flag the methodology gap, and had no source URLs. Six template fixes later — a populated terminology table, citation requirements, methodology questioning in the quality audit — the re-test passed every check.

The agent used correct terminology on the first draft, flagged the data contradiction, cited 19 URLs, and spontaneously self-reviewed its output against the banned words list.

When the terminology table had specific banned words, the agent spontaneously searched its own output and fixed violations. Vague rules like “use appropriate terminology” triggered nothing.

Two more fixes surfaced during prototype and doc stages: the doc-writer still generated metadata headers despite being told not to (needed an explicit ban in the agent definition, not just the project instructions), and the strategy doc buried the lede (needed a “first sentence of the document” rule, not just “first sentence of each section”).


What I Actually Learned

1. Rules must be explicit and repeated.

“No metadata fluff” in the project instructions wasn’t enough. The agent still generated Date/Audience/Status headers. The rule had to appear in both the project instructions and the agent definition with specific banned patterns. Vague rules don’t trigger self-correction. Specific rules with examples do.

2. Self-review emerges from specificity.

This was the surprise. I expected to need a separate validation pass for everything. Instead, when the terminology table listed specific banned words and their replacements, the agent searched its own output without being asked. The pattern: explicit constraints create implicit behaviors. Abstract constraints create nothing.

3. The agent doesn’t question its own data.

Contradictory statistics coexisted in the same table. The agent only flagged the issue when the quality audit explicitly said to check for logic gaps between adjacent data points. This isn’t a capability limitation — it’s an attention limitation. The agent can reason about contradictions. It just doesn’t look for them unless told to.

4. Visual design requires human eyes.

The agent matched the existing design system from screenshots and CSS. Every visual design issue — invented components, competing alert styles, wrong button semantics — was caught by me looking at the prototype in a browser. The agent can implement design. It can’t evaluate whether the result feels right in context.

5. Prototype and docs must be simultaneous.

Building the prototype tests design decisions. Writing the documents captures rationale. Building one after the other means the second is always outdated. The strongest output came from running both workstreams in parallel, with each informing the other.


The Trust Problem

I didn’t write a word of this project.

Not the prototype pages. Not the strategic documents. Not the cross-industry research. Not the codebase validation. Not the design system documentation. Not the terminology rules. Not the methodology template. Not the A/B test.

I wrote prompts. “Implement this plan.” “That reads like a blog.” “Should buttons say Select instead of Upgrade?” “Look at utility billing portals.” “Is this actually true?”

Everything else was produced by the agent.

So how does anyone trust it?


The Provenance Problem

The work product exists. The prototype is deployed. The documents are written. The research cites sources. The claims are validated against source code.

But the thinking — the reasoning that connected research to design decisions to copy choices to architectural recommendations — is buried in a prompt session in a terminal.

There’s no design file a designer made. No code a developer committed with their reasoning in the PR description. No document a strategist drafted with tracked changes. No research report an analyst compiled with a methodology section.

There’s a conversation. Thousands of messages between a human and an AI. The final artifacts look like they were produced by a team of specialists over weeks. They were produced by one person and one model over two days.

The strategy panels on each prototype page are one answer — they explain why each design decision was made, with citations. But they don’t show the five wrong versions that came before.

The A/B test log is another answer — it shows exactly what the agent got right and wrong, what I corrected, and what template fixes were needed. But it only exists because I decided to test the methodology.

The git history is a third answer. Every commit is a checkpoint. But the commits don’t capture the prompt that triggered the change.


What “Showing Your Work” Means Now

The outputs are verifiable. Every claim cites a source. Every codebase reference can be checked. Every competitive assertion can be confirmed. Every prototype page can be compared against the actual product.

What can’t be verified is the reasoning path. Why this framing instead of that one. Why this research direction instead of another. Why this terminology instead of the alternative.

That reasoning lives in the prompt session, which is ephemeral by default.

I don’t have a complete answer. But I have a starting point: make the methodology explicit, make the design decisions visible, make the claims verifiable, and make the process reproducible.

Trust the outputs by validating them. Trust the process by testing it. And recognize that the human contribution — the prompts, the judgment, the “what about X?” moments — is what separates a useful product initiative from a well-structured pile of generated text.

The 48-hour artifact is real. The question it raises — what does authorship mean when the work and the thinking are buried in prompt sessions — is one I’m still sitting with.

Share:

More in AI & Automation