Back to all posts
What Actually Worked When I Let Multiple AI Agents Rebuild My App
AI & Automation 3 min read

What Actually Worked When I Let Multiple AI Agents Rebuild My App

Clean specs, small tasks, a unified preamble, and ruthless guardrails. The model mattered less than the system.

NC

Nino Chavez

Product Architect at commerce.com

I refactored a React Native app to Next.js using multiple AI agents—Gemini CLI, Claude Code CLI, Codex CLI, Kilocode, and Copilot. I shipped features and learned where each agent cracks.

The win for solo dev: clean specs, small tasks, a unified preamble, ruthless guardrails. The model mattered less than the system.

The False Start

I started with Aegis—a governance framework with stage gates, audit trails, and policy-driven agent execution. It still matters at enterprise scale for compliance, traceability, and risk controls. But for solo work, it slowed me down. I didn’t yet have a repeatable way to feed agents clean, scoped tasks.

For solo dev, governance can wait. Specs and scope control cannot.

What Actually Worked

I stopped chasing “best model” and standardized how I work:

  • Specs first. One-pager plus task card with acceptance criteria
  • Bite-sized tasks. If it reads like an epic, split it
  • PR-first outputs. Branch, test, open PR—never silent local edits
  • AGENTS.md. A repo-level file that all tools read as the prompt preamble

The AGENTS.md file contains role and principles, coding standards, always-run commands, PR rules, escalation policy, and a “do not do” list. Every delegation says “Follow AGENTS.md,” so all agents inherit the same rules, vocabulary, and house style. No more prompt drift per tool.

Swapping tools became trivial. Quality didn’t nosedive just because I changed CLIs.

The Refactor

Scope: port core flows, keep auth/session parity, use RSC where it’s a win.

Agents did well on bulk file ops, routing scaffolds, codemods, glue-code rewrites. Tests worked when I pinned runner plus examples in the spec. PRs had decent commit hygiene if I gave a template.

I still did architecture calls (RSC vs client), auth boundaries, perf tradeoffs. And spec hygiene—vague spec meant mediocre PR.

Agent Scorecard

Each tool had different strengths. Copilot Chat was my default for PR-oriented work—it goes VS Code to PRs reliably and respects repo rules. Claude Code CLI shined on long-context refactors and big diffs. Gemini CLI was fast for utilities and test scaffolds. Codex CLI worked for surgical transforms in a tight loop. Kilocode handled patterned codemods across trees.

The throughline: outcomes mapped to spec clarity and AGENTS.md discipline, not model hype.

What Failed

My Jules run bombed because Node version, env vars, and test command weren’t pinned. The agent guessed; CI disagreed. That’s on me.

“Do everything” tasks produced kitchen-sink PRs. Now: one visible change or one capability per task.

Vague acceptance criteria led to surface-level solves. Fix: tiny fixtures plus expected outputs in the spec.

What I’m Still Figuring Out

The system works for solo development. I’m not sure how it scales to teams or to projects where multiple people are writing specs. The AGENTS.md file needs constant updating as patterns evolve. And cross-agent handoffs—Claude for refactor, Copilot for PR polish, Gemini for tests—is still experimental.

But the core insight seems solid: the model matters less than the system you build around it.

Share:

Originally Published on LinkedIn

This article was first published on my LinkedIn profile. Click below to view the original post and join the conversation.

View on LinkedIn

More in AI & Automation