Coding Got Cheap. Verification Did Not. | ML-Affairs

At A Glance

Why read Read this if your team is using LLM coding tools and is discovering that faster code generation does not automatically reduce review friction or speed up delivery.

Who it's for Especially useful for tech leads, staff engineers, and platform teams trying to scale delivery while review, integration, and trust remain stubbornly human bottlenecks.

What you'll learn Why the real constraint is now verification, why agents still do not carry risk ownership or real agency, and why smaller PRs, merge queues, and stronger guarantees matter more than raw output.

Takeaways Write throughput is rising faster than verification throughput · Agents can generate code, but they do not own production risk · The practical answer is smaller PRs, better guarantees, stronger automated checks, and disciplined integration

Right now, the loudest claim around LLM coding tools is that coding is becoming a commodity.

I think that is directionally right. What I do not think follows automatically is the part people usually jump to next: that software delivery will therefore speed up by the same factor. The more I use these tools, the less convinced I am by that leap.

Yes, they can write routine code quickly; they can refactor at a pace that would have felt absurd not long ago. But one friction point keeps getting sharper every time:

We have increased write throughput.

We have not increased verification throughput at the same rate.

That is the part I think many teams are about to feel much more acutely: review friction.

At least, that was obvious in my own team within a week of all of us adopting LLM CLIs more seriously in our workflow. Code was appearing faster. Refactors were cheaper. Experiments were easier to try. But the moment those changes started piling up, the real constraint showed itself again: someone still had to understand them, review them, and decide whether they were safe to merge.

And while this is easiest to see with LLM CLIs and all the current code-vibing enthusiasm, I do think the point extends to agents too.

Agents do not have the agency they would need to make software delivery scale in a production environment.

They can generate code. They can propose plans. They can widen the search space. But they do not own production risk. They do not carry on-call duty. They do not defend the change in front of a customer. They do not absorb the cost of being wrong.

That responsibility is still human.

And because that responsibility is still human, the bottleneck has moved.

Ninja engineers generating pull requests faster than a slower verification station can review, verify, and merge them.

The new imbalance is simple: code generation is accelerating faster than review and verification.

From Writing To Verification

For a while, most of the conversation around coding agents was about output:

how many files they can touch
how quickly they can scaffold
how much code they can produce in one go
whether coding itself is becoming a commodity

That is no longer enough as a way of thinking.

If code generation gets ten times faster while review, integration, and verification stay roughly flat, the system does not become ten times faster. It becomes unstable.

What used to be scarce was code production. What is scarce now is trust. And trust is slower. It lives inside:

review bandwidth
change understanding
test quality
integration sequencing
rollback confidence
the ability to explain why a change is safe

That is why I do not find “these tools make engineers faster” a very useful claim on its own. Faster at producing diffs is not the same thing as faster at delivering software. Worse, if you leave the system unchanged, the imbalance compounds:

more code appears
reviewers get overloaded
review quality drops
defects move downstream
rollback frequency rises
trust in generated changes starts to erode

So no, the bottleneck did not disappear; it moved from writing code to trusting code.

The Wrong Fix: More Agents

I think many teams are still responding to this with the wrong instinct.

If generation is cheap, they assume the answer is to introduce even more agents, even more automatic change, even more output.

But more agents do not solve a trust bottleneck; they amplify it. Without strong engineering constraints, cheap generation gives you:

bigger pull requests because exploration is cheap
noisier pull requests because changing code is cheap
more speculative diffs because rewriting is cheap
slower reviews because understanding still costs the same

That is not scale; it is faster chaos. If teams do not build a stronger trust system around these tools, they will not really scale AI-assisted development. They will just generate more change than they can responsibly absorb.

The Better Framing: Verification Systems Design

This is why I think the right framing is not “how do we optimise the PR process?”

but:

How do we design a verification system that can keep up with generated change?

Smaller PRs matter. Merge queues matter. I believe that strongly. But they are not enough on their own. They improve the shape of change. They do not automatically make change trustworthy.

If you want AI-assisted development to scale, you need a system that turns fast code generation into verifiable, reviewable, bounded progress. That means moving from reviewing code to reviewing guarantees.

A verification system is not just a pile of checks. It is a structured way of turning change into bounded, testable, explainable units of risk.

Review Guarantees, Not Just Diffs

Right now, too many AI-assisted workflows still look like this:

Tool writes code

Human reviews diff

Human approves

Hope nothing subtle broke

That does not scale; it just shifts cognitive load onto the reviewer.

The better pattern is to require every serious change to state clearly:

what changed;
what must remain true;
how we know it works;
what failure modes were considered.

If that information is missing, the reviewer is being asked to reconstruct intent from the diff, infer risk from context, and simulate behaviour in their head.

That is expensive, and that is exactly the kind of review friction we should be trying to remove. The important part is to make those guarantees tangible. For example:

this transformation preserves ordering invariants;
this refactor is behaviorally equivalent under property tests;
this change cannot affect downstream state transitions because the boundary remains unchanged.

Once a reviewer sees that kind of claim backed by evidence, the whole exercise changes. They stop scanning raw volume and start checking bounded risk.

Ninja engineers reviewing guarantees, invariants, tests, and failure modes instead of just scanning raw diffs.

A better review model is not “read more diff.” It is “check stronger guarantees.”

Back To Fundamentals

This is the part I find slightly amusing. Once you follow the argument through, the answer starts sounding strangely old-fashioned. If review friction is the bottleneck, then we do not get out of it with more theatrical tooling.

We get out of it by returning to fundamentals:

smaller PRs;
clearer intent;
narrower scope;
better tests;
merge queues, and;
easier rollback.

That is not because these are fashionable process ideas. It is because they reduce the cost of review and verification.

Large PRs force reviewers into archaeology. They have to reverse-engineer intent, infer boundaries, and simulate outcomes in their head.

Small PRs let them ask a much narrower question:

Is this one change understandable, bounded, and safe to merge?

That is a real throughput advantage.

In an agent-assisted workflow, this matters even more. The natural temptation is to let the tool range widely and submit one impressive diff. That is exactly the wrong shape of change if trust is the bottleneck.

So yes, smaller PRs, stacked changes, narrow intent, and one decision per review unit, become a must! They are no longer, simply, about hygiene. It is part of the verification system.

This is also where a simple test-driven instinct helps a lot. For example, if someone wants to do a refactor, one very clean pattern is:

first PR: add tests and increase coverage
second PR: do the refactor

The separation matters.

In the first PR, the intent is obvious: we are improving confidence.
In the second PR, the tests stay fixed, which makes the claim much narrower: behaviour should stay the same.

That lowers cognitive load immediately.

The same principle generalises. If a change is behavioural, keep the scope small. If a feature is large, deliver it in steps. The hardest work is usually restructuring, and that is exactly where thinking hard about incremental delivery matters most.

If you want something practical to adapt for your own team, I put together a reusable reference here:

PR template for higher-trust AI-assisted delivery

Force Decomposition At Generation Time

This is where I would push the workflow harder. Do not wait until review time to discover that the diff is too large. Force decomposition earlier.

The correct shape is:

Task

Plan

Substeps

PR sequence

Not:

Task

Giant AI diff

Panic review

This is one of the most useful things these tools can do, by the way. They should not just write code. They should help propose the incremental delivery plan by which the code can be introduced safely.

That is a much better use of an agent than simply asking it for more implementation.

Ninja engineers breaking a large feature into small pull requests that move through CI, checks, review, and merge in an orderly queue.

Small PRs are not tidiness theatre. They are one of the cleanest ways to lower review friction.

Shift Validation Left Into Machines

If humans remain the primary validators of AI-generated code, I do not think the model scales very far.

Humans should still own risk. But they should not be forced to simulate execution in their head for every meaningful change. That means stronger machine-side verification.

1. Property-based testing

I think property-based testing is one of the most underused tools here.

Why?

Well, because many AI-generated bugs are not obvious syntax bugs. They are edge-case bugs. Boundary bugs. “This looked correct for three examples and broke on the fourth” bugs.

Property-based testing helps because it checks invariants across many generated inputs instead of blessing one or two happy-path examples.

A few practical cases (skip these if you get the point):

a parser should round-trip valid inputs without losing structure
a serialization layer should preserve data after encode/decode
a ranking function should preserve ordering invariants you care about
a pricing or allocation function should never produce negative totals or violate conservation constraints
a stream transformation should preserve event counts when it is not supposed to drop or duplicate events
an aggregate that should only grow as more events arrive should remain monotonic
a pipeline that depends on arrival order should preserve event ordering where that contract is supposed to hold

That matters because it turns “I read the diff and it seemed fine” into “the core property stayed true under many cases.”

That is a better verification signal.

2. Static analysis gates

Static analysis is another place where teams should be more aggressive.

Not static analysis theatre. Not one more badge in CI. Real gates.

Practical examples:

type errors should fail fast;
nullability violations should fail fast;
unsafe imports or forbidden dependencies should fail fast;
obvious dead code or unhandled branches should fail fast, and;
insecure patterns or dangerous API usage should fail fast.

The more routine structural mistakes a machine can reject automatically, the less human energy gets wasted on basic hygiene.

That leaves humans freer to review the part that actually matters: design, guarantees, and risk.

3. Runtime assertions

I am much less enthusiastic about runtime assertions than about tests, validation, or stronger system boundaries.

Most of the time, if you need an assertion, it is worth asking whether the system should have prevented that state earlier through better design, clearer contracts, or stricter validation.

In other words, I would not treat assertions as a primary verification strategy.

They still have a narrow place, though, around internal invariants that should be impossible if the rest of the system is behaving correctly. For example:

a state machine reaches an illegal transition;
two mutually exclusive internal flags are both true;
an event-ordering assumption inside one component is suddenly broken, and;
an internal contract is violated in a way that risks silent corruption.

That is where a loud failure can be better than quietly propagating bad state.

So ok, assertions can help, but only as a last line of defence. I would much rather prevent bad states than merely notice them at runtime.

Add Risk Awareness To Review

Another thing I think teams need is a more explicit notion of change risk.

Not every AI-generated change should go through the same review path.

There is a difference between:

a local refactor;
a business-logic change;
a concurrency change;
a stateful systems change, or;
a distributed recovery or integration change.

Those should not all be treated as the same kind of review object.

What I would want is some form of confidence or risk scoring:

🟢 low-risk cosmetic or local changes get a lighter path
🟠 medium-risk logic changes get stronger automated evidence
🔴 high-risk stateful or distributed changes get narrower scope and deeper human scrutiny

Right now, most teams still treat this too uniformly:

Open PR

Assign reviewer

Hope for the best

That is not mature enough for the level of change velocity these tools can produce.

Trust Is What Makes Automation Scale

If there is one broader point underneath all of this, it is that:

Automation does not scale on capability alone; it scales on trust.

If an AI system is not trustworthy, people will hesitate to adopt it, hesitate to depend on it, and ultimately refuse to give it real responsibility. That is true whether we are talking about coding tools, agents, or any other form of automation.

And trust does not appear by magic. It comes from being able to explain what the system is doing, trace why it did it, bound the risk, and verify that it is behaving safely enough to rely on.

That is why verification matters so much. A strong verification system is how an organisation turns output into trust.

The self-driving cars example makes that point clear.

The problem with self-driving was never just whether people would emotionally accept the absence of a driver.

You can put a human in the driver’s seat and solve part of the problem for a while. That gives you supervision, and maybe enough trust to experiment. But it also shows the limit immediately: you still have not built enough trust into the system for automation to carry the responsibility on its own.

To unlock the real benefit, you need a validation system strong enough to make the absence of a driver trustworthy.

Simulation mattered.
Certification mattered.
Safety cases mattered.

Verification pipelines mattered.

We did not start trusting self-driving because models improved. We trusted it only to the extent that validation systems became industrial.

We do not need agents with mystical agency.

We need enough trust in their output that automation can carry more of the load without a human having to re-derive everything from scratch.

An autonomous delivery car navigating a structured path through tests, review, small PRs, and production while a ninja engineer observes from the side.

Automation starts to scale when trust is built into the delivery path itself, not when a human has to keep rescuing the system from the driver’s seat.

If I were designing for this bottleneck deliberately, I would want something closer to this:

A task is decomposed into a sequence of narrow changes before major implementation begins.
Each change states intent, invariants, and how correctness will be validated.
Automated checks do the first line of trust work: tests, static analysis, diff classification, CI.
Reviewers focus mostly on boundary decisions, guarantees, and system fit.
Merge queues and rollback paths keep integration disciplined and stop trust from being wasted in merge thrash.

That is a much more serious model than “AI writes, human skims, merge and pray.”

The practical takeaway is not to resist agents. It is to build an engineering system where review and verification can keep up with them.

The real unit of speed is not how quickly code appears in a branch. It is how quickly a team can move a change from idea to trusted production without losing control of the system.

That is the metric that matters. And once you define speed that way, the answer stops sounding futuristic. It becomes strangely familiar:

smaller PRs;
clearer intent;
stronger guarantees;
better tests;
static analysis gates;
selective runtime assertions;
merge queues, and;
low-friction rollback.

These are not bureaucratic leftovers from a slower era. They are what make faster tooling usable.

If LLM tooling keeps improving, the teams that win will not be the ones that generate the most code.

They will be the ones that turn trust into a system.

If coding is becoming a commodity, verification is not.

And if agents do not have agency, the burden of trust still sits with us.

Many teams are about to discover that the next productivity battle is not about writing code at all. It is about whether their engineering system can metabolise AI-generated change without losing control.

The best prompt in the world will not save a team that cannot review, verify, and integrate change with discipline.

That is a much less theatrical advantage. It is also the real one.

ML-Affairs