What Actually Worked When I Let Multiple AI Agents Rebuild My App

I refactored a React Native app to Next.js using multiple AI agents—Gemini CLI, Claude Code CLI, Codex CLI, Kilocode, and Copilot. I shipped features and learned where each agent cracks.

The win for solo dev: clean specs, small tasks, a unified preamble, ruthless guardrails. The model mattered less than the system.

The False Start

I started with Aegis—a governance framework with stage gates, audit trails, and policy-driven agent execution. It still matters at enterprise scale for compliance, traceability, and risk controls. But for solo work, it slowed me down. I didn’t yet have a repeatable way to feed agents clean, scoped tasks.

For solo dev, governance can wait. Specs and scope control cannot.

What Actually Worked

I stopped chasing “best model” and standardized how I work:

Specs first. One-pager plus task card with acceptance criteria
Bite-sized tasks. If it reads like an epic, split it
PR-first outputs. Branch, test, open PR—never silent local edits
AGENTS.md. A repo-level file that all tools read as the prompt preamble

The AGENTS.md file contains role and principles, coding standards, always-run commands, PR rules, escalation policy, and a “do not do” list. Every delegation says “Follow AGENTS.md,” so all agents inherit the same rules, vocabulary, and house style. No more prompt drift per tool.

Swapping tools became trivial. Quality didn’t nosedive just because I changed CLIs.

The Refactor

Scope: port core flows, keep auth/session parity, use RSC where it’s a win.

Agents did well on bulk file ops, routing scaffolds, codemods, glue-code rewrites. Tests worked when I pinned runner plus examples in the spec. PRs had decent commit hygiene if I gave a template.

I still did architecture calls (RSC vs client), auth boundaries, perf tradeoffs. And spec hygiene—vague spec meant mediocre PR.

Agent Scorecard

Each tool had different strengths. Copilot Chat was my default for PR-oriented work—it goes VS Code to PRs reliably and respects repo rules. Claude Code CLI shined on long-context refactors and big diffs. Gemini CLI was fast for utilities and test scaffolds. Codex CLI worked for surgical transforms in a tight loop. Kilocode handled patterned codemods across trees.

The throughline: outcomes mapped to spec clarity and AGENTS.md discipline, not model hype.

What Failed

My Jules run bombed because Node version, env vars, and test command weren’t pinned. The agent guessed; CI disagreed. That’s on me.

“Do everything” tasks produced kitchen-sink PRs. Now: one visible change or one capability per task.

Vague acceptance criteria led to surface-level solves. Fix: tiny fixtures plus expected outputs in the spec.

What I’m Still Figuring Out

The system works for solo development. I’m not sure how it scales to teams or to projects where multiple people are writing specs. The AGENTS.md file needs constant updating as patterns evolve. And cross-agent handoffs—Claude for refactor, Copilot for PR polish, Gemini for tests—is still experimental.

But the core insight seems solid: the model matters less than the system you build around it.

What Actually Worked When I Let Multiple AI Agents Rebuild My App

Originally Published on LinkedIn

More in AI & Automation

From Research Paper to Working Toolkit in One Session

Spec-Driven Development with Multi-Agent Orchestration

The Seven Stages of AI Adoption