AI agents are shipping your code. Who's checking their work?

Each week leading up to AI Council, May 12-14, 2026 I'm digging into one some of the tracks we’re excited to cover. This week: how to trust the quality of hundreds of AI coding agents

Apr 07, 2026

For those of you who don’t know me, my name is Pete Soderling. I started AI Council (formerly Data Council) with one purpose: bring together the sharpest builders in the world to talk about what they’re actually experiencing in the field. No BS. No hype.

Each piece tees up a question we’re planning to go deep on. If something resonates — or you disagree — comment below. That’s the whole point!

How do we trust the work of hundreds of agents?

I’ve been a software engineer for a long time. I’ve worked on big teams and small ones, and I’ve watched the job change in ways that felt seismic at the time — the DBA becoming a relic, DevOps absorbing what used to be its own discipline, cloud eating everything that came before it. Each time, the tools and roles that felt permanent turned out to be temporary. I think we’re in one of those moments again, except the disruption this time feels deeper and faster than anything I’ve seen before.

What’s been on my mind lately is where the pressure actually comes from with vibe coding. The natural instinct is to look downstream — what breaks, what’s the blast radius, how do we manage quality when agents are delivering a firehose of code at a volume no human team could match? But what my conversation below with Eno Reyes, Co-founder and CTO of Factory points to, is that the disruption runs in both directions. To the left of the code, there’s a spec writing and planning process that needs to be formalized — a precursor to high-quality vibe coding that can also be tapped post-production to make sure the spec is actually being honored.

Vibe coding is atomizing the tool chain in both directions, both downstream and upstream of the coding itself. And that requires different processes that I’m not sure most teams have fully reckoned with yet.

Last week, I shared my conversation with Scott Breitenother about the mindset shift engineers need to make to work successfully with agents. This week is about what comes next — how do you actually manage quality at scale?

Eno Reyes is one of the builders closest to this problem. Factory’s Droids have run Missions lasting 14 days straight, with hundreds of agents executing continuously. He’s seen what breaks and what doesn’t. In our conversation, we get into what agent readiness requires, why the initial spec is more important than most teams realize, and why silent failure modes are the part nobody is ready for.

Q&A with Eno Reyes, Co-founder & CTO of Factory

Pete: How can the code quality problem be solved when thousands of lines of code are being written?

Eno: It’s a combination of finding the right platform and choosing to invest in it as a software development organization.

There's a certain amount of agent readiness work you can do — evolving the way you ship software to orient around what agents are actually capable of. In practice, that means introducing deterministic type checks, linters, and code formatters: a layer of structure that goes beyond what most software orgs have today.

From an agent management perspective, there are also best practices your team needs to adopt: spec-driven development and knowing when to fix the code yourself versus when to pull agents in.

The longest Mission we saw run was 14 days straight of continuous effort from hundreds of agents. When you’re shipping that volume of code, investing in automated QA and testing is really the only way to validate correctness at the frequency agents enable.

Pete: Is there an equivalent of a PR for specs?

Eno: We’re generally getting to a world where the specific lines of code are less relevant than the architecture of the code and the architecture of the change. Ensuring correctness is not done at the line level by humans anymore — it’s done by systems (like deterministic type checkers, formatters, linters, mentioned above) where you say the code must maintain this structure.

The spec matters, but only as a procedural input to the larger task. We don’t store the specs we use to ship a prototype, because the code really becomes the spec once it’s done. What’s necessary is some level of abstraction between the code in your codebase and what you as a human consume. Our auto-Wiki product does this — a continuous internal engineering overview that lets you view architecture and structure across systems. If you have specific questions about implementation, you can just ask Droid: how does this work?

Specs will matter, but only as an input. And validating that spec needs to happen one step before they’re generated — that’s about directional alignment across the team. Product should be involved. Engineering should be involved. The lines are blurring, and we don’t really have a system for that yet, other than whatever standards your company enforces around design docs.

Pete: What do you think is missing in order to make agentic coding trustworthy enough to deploy at scale?

Eno: From a technical perspective, all the pieces are there, but the biggest barrier is winning trust from developers, and it requires them seeing it used, seeing the controls in place. We saw this firsthand in deploying from zero to 10,000 engineers at one company in under three months.

So we did a major UX redesign of Missions where the execution was procedurally unchanged, but we redesigned how we presented that information. That alone brought the autonomy developers were willing to grant the system up significantly. People were way more willing to hand over authority and control, entirely based on an interface change that made them feel more confident in the work being done.

It’s a combination of how you roll out and enable people alongside UX that grants visual transparency into what’s going on.

Pete: Your droids can run on full autopilot, but do you see teams keeping a human-in-the-loop at a certain part of the workflow?

Eno: It’s totally dependent on where Droids are slotted in. Most people are 100% okay running Droids in their CI/CD pipeline for code review and QA — full auto, because they control what it can and can’t do, they know the environment, they know how it’s locked down. On local laptops, there’s usually a different set of controls entirely.

We track what we call the autonomy ratio for all of our customers — and you can see it trending upward over time as trust builds. The ratio is typically around 17-19x, meaning for every 16-18 actions a Droid takes, there’s one human action involved: hitting escape, queuing a message, adjusting a setting.

At Factory, our internal coding autonomy ratio for our own development is close to 35x. So you could be running roughly twice as autonomously as the average customer today. We see a lot of room to grow.

Pete: How does the quality story change for data pipelines and infrastructure, where failure modes can be silent?

Eno: The analogy I’d draw is what happened when we went from on-prem to cloud — not just a modernization of the pipeline, but an upgrade in the rigor by which code is actually maintained and built. We’re seeing something similar as organizations move from cloud-based to AI-enabled. There’s now a real reason to introduce that rigor — autonomous systems are working on your code, and the stakes of getting it wrong are different.

Data is actually one of our number one use cases. When you’ve hooked up your data to Droid, you can ask it any question about any aspect of your business and get answers. Building an ETL pipeline or a data transformation on top of that is just a matter of saying: I did this one-off with Droid, now I want it institutionalized.

The only way to get around failure modes is to introduce traditional backend guardrails. How do you know a table you’re creating isn’t going to be extremely costly because it joins two massive tables? How do you know there’s not a weird union that 10x’s your BigQuery bill? Rarely have data teams gone to that level of rigor unless they’re a very large company. Our product will literally warn you: we’ve noticed there are no query linters, no guardrails — be careful, because you’re using AI in an environment that doesn’t yet have the infrastructure necessary to let autonomous agents run.

Pete: Thanks, Eno! Looking forward to continuing the conversation at AI Council.

Eno: Same — talk then!

Watch the full conversation here:

There’s a lot more to dig into on AI coding. A few questions still on my mind for AI Council:

How do you maintain engineering standards and culture when most of the code is being written by agents?
What does good look like for a team that’s successfully scaled agents? What KPIs should we be tracking to know?
Are the best teams slowing down to build the right infrastructure first — or are they just moving fast and cleaning up later?
Do traditional design patterns matter more or less when agents are writing the code — and does the answer change depending on whether you’re optimizing for human comprehension or agent reliability?

Also worth reading on this topic: Addy Osmani, engineering lead at Google Chrome, “Comprehension Debt - the hidden cost of AI generated code.”

I’m excited to hear from Benn Stancil, Co-founder of Mode, Jason Ganz, Director, DX + AI of dbt Lab, Emilie Schario, COO & VP of Eng at Kilo Code, Calvin French-Owen, Co-founder & CEO formerly of Segment, Wes McKinney, Principal Architect at Posit, and others at AI Council, May 12–14, 2026. Grab your ticket here.

If you have another meaty question about AI coding — drop it in the comments.

Pete Soderling

Discussion about this post

Ready for more?