The optimal number of unreviewed PRs is not zero

Code got cheap. Review didn’t. What queueing theory says about the AI-era backlog.

John Begeman · June 2026

“The dominant paradigm for managing product development is fundamentally wrong. Not just a little wrong, but wrong to its very core.”

Donald Reinertsen, The Principles of Product Development Flow

Every engineering lead I talk to has the same complaint right now: AI made writing code fast, and now code review is the bottleneck. PRs pile up, reviewers drown, and the usual response is some mix of “we need more reviewers” and “we need AI to review the AI’s code.”

I went back to Don Reinertsen’s The Principles of Product Development FlowDonald G. Reinertsen, The Principles of Product Development Flow: Second Generation Lean Product Development (Celeritas Publishing, 2009). Dense, unglamorous, and more useful per page than anything else written about how engineering organizations actually move. with this problem, and his framework rejects the framing before it does anything else. You don’t have a code review problem. You have a queue, and you’ve stopped looking at it.

The queue you can’t see

Reinertsen’s central argument is that development organizations obsess over what they can see (output, utilization, busy people) while ignoring the most expensive thing in the system: work sitting in queues. An unreviewed PR is inventory. It ages, goes stale against main, and blocks whatever is behind it. Most teams can quote their deploy frequency to two decimals and have no idea what their median PR wait time is, or whether it’s growing.Isn’t this just DORA? Not quite. DORA’s lead time for changes runs commit-to-production, which contains the review queue but doesn’t isolate it — a fine end-to-end alarm, useless for locating the bottleneck. The numbers you want are pickup time (open to first review), queue depth, and queue age. Deploy frequency, meanwhile, is a DORA metric, which is exactly why everyone can quote it. I didn’t, until I checked.

Step one is boring. Instrument the queue: depth, age, arrival rate against service rate.Little’s Law ties these together: L = λW — the average number of items in a system equals the arrival rate times the average time each item spends there. It holds for any stable queue, no assumptions about distributions required. If PRs arrive at 9 a day and sit for 2 days, you carry 18 open PRs. Always. Until then you’re managing the wrong variable.

Why it got bad so fast

Wait time explodes nonlinearly as utilization approaches 100%. That’s not a management opinion, it’s queueing theory, and it means a reviewer whose review capacity is 95% spoken for is not your most efficient reviewer.Utilization here means demand against review capacity — the handful of hours a week genuinely available for reviewing after someone’s own work — not 95% of their waking hours. Nobody reviews nine hours a day. That’s part of the problem: the queue is served by a far smaller pipe than the org chart suggests, so it saturates faster than anyone expects. They’re the reason the queue is three days deep.

Figure 1. Time a PR spends in the system, in multiples of the bare review time, as reviewer utilization ρ rises. For a single-server queue with random arrivals, W = W₀ / (1 − ρ). The curve is flat for a long time, and then it isn’t. Everything interesting about your backlog happens to the right of 85%. 0 25% 50% 75% 100% reviewer utilization ρ 10× 20× time in system ÷ review time ρ = 70%: waits ≈ 3× ρ = 95%: waits ≈ 20× your “most efficient” reviewer

Now add AI. The tools cranked the arrival rate into a stage whose capacity is fixed and was probably already running hot. Nobody removed a bottleneck here. We relocated one and made it worse, which is what speeding up one stage in isolation always does: the inventory just lands at the next handoff.

Little’s Law makes the damage concrete, and the numbers are worth staring at. Take a team whose reviewers can clear ten PRs a day. Before AI they received seven a day — 70% utilization, PRs waited about a third of a day, two or three open at any time. Fine. Now the same team receives nine and a half a day. Arrivals went up 36%. The queue did not go up 36%.

Figure 2. The same review team before and after AI, capacity fixed at 10 reviews/day. Arrivals rise from 7 to 9.5 per day (×1.36). Time-in-queue rises from a third of a day to two days (×6); by L = λW, PRs waiting rise from about 2 to 19 (×8). A modest change in input, an order of magnitude in the queue. This is why it felt sudden. before AI after AI arrival rate ×1.36 time in queue ×6 PRs in queue ×8 (2.3 → 19 open) 1× baseline

The queue feeds itself

The static picture is bad enough, but a hot queue doesn’t stay static. A loaded review stage changes the behavior of the people feeding it, and every behavioral change makes the load worse.

Watch what authors do while a PR waits. They don’t idle; they start the next branch. Every blocked engineer fans out into more open work, all of it headed for the same queue. Waiting generates arrivals.

Meanwhile the waiting PRs age. Main moves underneath them, so they need rebases, conflict resolution, and re-review — each stale PR now consumes more reviewer capacity than it would have if it had been reviewed the day it was opened. Authors adapt too: if review takes three days regardless, why send three small PRs when you can send one big one? Batches grow, and each review gets slower. And rushed reviewers rubber-stamp, so defects merge and come back later as bug-fix PRs — arrivals you wouldn’t otherwise have had.

Figure 3. The reinforcing loop. Each mechanism on the bottom edge pushes utilization back up, which lengthens waits, which strengthens every mechanism. Past saturation the queue stops being a passive buffer and becomes an actor in its own growth — which is why teams cross from “fine” to “permanent crisis” in weeks without anyone making a bad decision. reviewer utilization rises PRs wait longer authors adapt parallel branches · bigger batches · stale rebases · rubber-stamp merges more arrivals, slower reviews R each pass around amplifies the next

All four arrows point the same direction: utilization goes up. That’s a reinforcing loop, and it’s why the degradation never feels linear from inside. Figure 1 showed a static curve; the loop is what drags you along it. A team that was fine at 80% crosses the bend, the queue starts feeding itself, and a few weeks later review feels permanently broken even though nothing about the team changed.

There’s an encouraging flip side. The loop runs in reverse, too. Pull utilization back below the bend and every arrow weakens at once: PRs stop aging, batches shrink back, parallel WIP drains, rubber-stamping subsides. The interventions below look modest on paper because paper shows the first-order effect. What you’re actually doing is switching off the amplifier.

What to actually do

Five things fall out of the framework, and “hire more reviewers” isn’t one of them.

Buy reviewer slack explicitly. “Run below full utilization” is useless as advice if nobody owns it, so make it structural. Count review as planned work: most sprint planning treats it as free, which is how reviewers end up at 95% without anyone having decided that. If reviews eat a fifth of a senior engineer’s week, that fifth goes in the plan. Some teams go further and rotate a designated reviewer whose job that week is draining the queue, staffed the way you’d staff an on-call. And watch the utilization number the way you’d watch an SLO — sustained load above ~85% is an early-warning signal (look at where Figure 1 bends), not a productivity win.

Minimize transaction costs by packaging work to minimize review time. The underlying lever is batch size, and the economics moved: the optimum sits where transaction cost meets holding cost (Figure 4), and AI just cut the authoring share of that transaction cost. What remains of it is mostly reviewer attention — so reviewer service time is the thing to minimize, not diff size. A reviewer clears work fastest when each unit asks them to make one coherent decision. Sometimes that means a smaller PR. Just as often it means splitting the same work along better seams: the mechanical rename in one PR and the behavior change in another, a stacked series where each step is obviously correct on its own, generated test scaffolding separated from the logic it exercises. A 900-line diff that’s one decision reviews faster than a 200-line diff that’s five. This matters doubly in a hot queue, because every minute shaved off service time is reviewer capacity you didn’t have to hire. Two cautions: each PR still carries fixed overhead (a CI run, a context switch), so don’t shard for sharding’s sake — and attack that overhead with faster CI and lighter ceremony for small diffs, so finer-grained packaging stays economical.

Figure 4. Total cost per change is the sum of transaction cost (falls with batch size) and holding cost (rises with it). The optimum sits at b* = √(t/h) and moves with t. AI cut the authoring share of t, dragging the optimum left — but only as far as your remaining per-PR overhead (CI runs, review ritual) allows. PR size → total cost per change hand-written change AI-written change

Cap work in process. A hard limit on open PRs forces the queue to drain before more work gets admitted. No forecasting required, which is most of its charm.

Tune the arrival rate. Utilization is a ratio, and everything so far has worked the capacity side. The other side is what you let into the queue, and that’s a policy choice masquerading as a law of nature. Not every change needs the same review: a risk-tiered policy — auto-merge for lockfile bumps and generated docs, a lightweight single-reviewer pass for low-blast-radius changes, full review reserved for anything touching money, auth, or data — can remove a real share of arrivals without touching the risk you actually care about. Reviewing everything identically is FIFO’s cousin: a policy chosen for its evenness, not its economics. Further upstream, the arrival rate is set by what work gets started at all, and that’s an alignment problem. If the organization hasn’t agreed on what’s valuable, every team generates marginal work in good faith and ships it to the same queue. Clear priorities are congestion control. (More on the demand side below.)

Stop reviewing FIFO. Take the high-value, quick-to-clear changes first.Reinertsen’s name for this is Weighted Shortest Job First: sequence by cost of delay divided by time to clear. It’s the scheduling policy that maximizes value delivered through a constrained resource. Blind FIFO will happily leave a fix worth $10k a week in delay stuck behind a cosmetic rename, and nobody will even notice it happened.

The tooling gap

That last lever has a problem, and it’s worth being honest about it: try to implement it tomorrow. Your review queue is presented sorted by recency. The signals attached to each PR are all mechanical — diff size, CI status, mergeability. Nothing anywhere encodes what the change is worth. The PR that unblocks next quarter’s launch and the routine dependency bump render as identical rows, so people fall back to FIFO, or worse, to whatever’s easiest to clear. The tools made cost visible and left value invisible, which is Reinertsen’s complaint fossilized into software.

And it’s not only review. The same prioritization decision recurs across the whole SDLC — which work gets built at all, which changes get the heavyweight test pass, which fixes ship first — and every one of those queues is sequenced by arrival order or by gut, because expected value lives in someone’s head instead of in the system.When Reinertsen asks members of the same team to estimate the cost of delay for the same project, the answers routinely spread by a factor of 50. His point isn’t that estimates are good; it’s that unexamined intuition is far worse. This is the gap I’d most like to see tooling close.

The irony is that the technology that created the overload is well suited to easing it. Estimating expected value is a judgment call — what does this change unblock, what does it risk, what is it worth a week sooner — and judgment at scale is precisely what just got cheap. A triage layer that attaches even a rough cost-of-delay and time-to-clear estimate to each PR, and sorts the queue by their ratio, doesn’t need to be accurate to be useful. The benchmark it has to beat is FIFO. Crude beats blind.

You probably need more queues

There’s a structural move sitting above all the levers: change how many queues you have, and who feeds them.

Queueing theory seems, at first, to argue against this. The famous result says a single pooled queue feeding many servers beats separate lines — it’s why supermarkets went to one line for all registers.Pooling wins provided jobs and servers are interchangeable — any cashier, any cart. That proviso does all the work in what follows. But code review violates the proviso completely. Review service time is dominated by context: the reviewer who owns the subsystem clears a change in twenty minutes; one who’s never touched it takes half a day and catches less. Pool reviews across a wide org and you guarantee that more changes land on cold reviewers. The pooling math runs backwards.

So the structure that minimizes transaction cost is many small queues, each owned by a small, focused team with deep context over its own surface. Inside a team like that, review is cheap — shared context collapses the setup time of every review. And recall what Figure 4 says falls out of a smaller t: smaller economical batches, and faster flow. Team boundaries are a transaction-cost technology.

Figure 5. Pairwise coordination channels, n(n−1)/2, against contributors to a shared codebase. Ten contributors share 45 channels; twenty share 190. Doubling the people roughly quadruples the coordination surface — merge conflicts, churned context, review load, CI contention all scale with the pairs, not the people. 10 20 30 contributors to one codebase coordination channels 10 people: 45 pairs 20 people: 190 pairs

There’s a second nonlinearity stacked on top. A common codebase is itself a shared, congested resource. Every additional contributor to the same surface imposes costs on all the others — merge conflicts, churned context, review load, CI contention — and those costs grow with the number of pairs, not the number of people.Fred Brooks, The Mythical Man-Month (1975). Communication paths grow as n(n−1)/2, which is why adding people to a late project makes it later. Fifty years old and newly urgent, now that the marginal “person” might be an agent and you can add twenty of them in an afternoon. Twice the contributors to one codebase isn’t twice the friction; it’s closer to four times. AI makes this acute, because agents are contributors too, added by the dozen, and usually pointed at the same shared code.

The fix is boundaries, which is to say architecture. Carve the codebase along ownership lines so that most changes start and finish inside one team’s surface, one team’s queue, one team’s context. Reinertsen makes the same argument about shared specialized resources in general: centralizing them looks efficient on a spreadsheet and generates queues in practice. Cross-team changes don’t disappear, but they become the priced exception instead of the ambient condition. You ship your org chart either way; the move is to draw an org chart made of small teams with real boundaries, so that the flow you ship is the flow you chose.

When the right batch is enormous

There’s an objection to all this small-batch discipline worth taking seriously, put well in a recent apenwarr essay:apenwarr, on annealing (May 2026). Software development as simulated annealing: start with big, high-energy jumps through the solution space, lower the temperature as the system matures. The line that sticks: “not every big step is made of small steps.” some changes don’t decompose. A new product whose components only have value together. An architecture that’s wrong and needs replacing, not sanding. Apply small-PR ritual there and you aren’t managing risk, you’re trapping yourself in a local optimum — while AI stands by, perfectly happy to make changes as big and interconnected as you can prompt.

This looks like a contradiction of Figure 4. It’s actually the same equation in a different regime. The optimum balances per-batch overhead against the cost of carrying a big unvalidated change — and both terms are wildly different for a young system. A product with no users has almost no carrying cost: a 4,000-line jump breaks nobody, strands no migrations. Meanwhile the cost of delay on learning is enormous — every week spent slicing the jump into politely reviewable steps is a week not finding out whether the whole direction works, while the market window slides. Price those honestly and the optimal batch is huge. A big chaotic jump is simply what a very high cost of delay looks like once you’ve priced it. For a mature product the same terms reverse — modest upside per change, catastrophic downside across millions of users — and the optimum collapses toward small. The annealing temperature isn’t a rival theory of development; it’s the location of Figure 4’s minimum, moving as the product ages.

Reinertsen even supplies the underlying principle: variability has positive economic value when payoffs are asymmetric. Early on, a big jump’s upside is unbounded and its downside is floored — worst case, you discard code that was nearly free to generate. Buy variance. At maturity the asymmetry flips, and you sell it. What reads as a culture clash between “founder mode” and “process discipline” is mostly two parameter settings of one economic function, each correct somewhere.

Two cautions survive the regime change. Even when the decision must be big, the review units often needn’t be: a coherent stacked series can carry a large jump without forcing one 4,000-line diff or a dozen context-switching fragments — that dichotomy is a tooling deficiency, the same transaction cost this essay keeps trying to shrink. And the trap runs in both directions: an org annealed into small-step ritual will be structurally unable to make the jump when the architecture really is wrong. Knowing which regime you’re in is the actual leadership job, and it’s a cost-of-delay estimate like everything else here.

The part nobody wants to hear

The deepest point is about demand. When code is nearly free to generate, you will generate marginal-value code, simply because you can. The eleven cosmetic refactors clogging your queue exist because producing them cost nothing. Owning them costs plenty: review load, merge risk, test surface, maintenance, none of which shows up at generation time. AI didn’t make code cheap. It made one step cheap and shoved the rest of the bill into the parts of the system you weren’t measuring.

The disciplined conclusion is to produce less. I laughed when I first wrote that down, because I know exactly how it lands in a planning meeting. So don’t sell it that way. Sell it as: stop paying to review things you’ll regret merging. Put a cost-of-delay number on work and killing the marginal stuff stops sounding like a slowdown. It’s how the $200k feature gets out from behind the refactor pile. A CFO buys that story.

Did you overhire?

It’s the question sitting underneath all of this, usually asked quietly. If the team consistently produces PRs that aren’t valuable — and they’re eating scarce review time — does that mean the team is too big?

Maybe. But look at what’s generating the marginal work before reaching for headcount. A flood of low-value PRs is usually a system output, not a people surplus. Production is nearly free, priorities are vague, and the organization measures visible output — so idle capacity converts itself into marginal PRs, in good faith, because shipping something is how people prove they’re working. The junk in your review queue measures the absence of a value filter more than it measures excess people. Cut a fifth of the team without changing any of that, and the remaining four-fifths will keep producing marginal work, just more slowly.

The deeper mistake is the target the question implies: that a right-sized team is one where everyone is fully loaded with valuable production work at all times. Look at Figure 1 one more time. A system where every person is 100% booked isn’t an efficient system, it’s a slow one — wait times explode, and the valuable work is precisely what sits. Hospitals don’t size emergency rooms so the beds are always full; the entire point of the capacity is to absorb variation in arrivals. Engineering works the same way. When an engineer has nothing valuable to build this week, the response the system needs is not a manufactured PR, and it isn’t a layoff either — it’s capacity standing ready, or better, moving to the constraint. The person with nothing worth authoring should be draining the review queue, not feeding it.

What actually separates useful slack from dead weight is whether capacity can move. And that’s mostly a property of the work, not the people. When work arrives as large entangled lumps, every engineer is welded to their lump, idle capacity is stranded where it stands, and the only levers you have are hire and fire. Break the work down into small, well-packaged units with legible value — the same packaging the levers above already demanded — and capacity becomes fungible. The same engineer can author this week and drain the review queue next. The constraint can be staffed when it moves, and it will move. And “is this worth doing” becomes a question you ask of each unit of work rather than of each person, which is a far better question. You end up managing the flow instead of the individuals. AI changed the production function under every team that hired for the old one; clear breakdown is what lets you rebalance instead of guess.

And once all of that is true — priorities priced, work legible, capacity mobile — the question finally becomes answerable. If people still can’t find valuable work to do or a constraint to relieve, then yes: at the current cost of production, the team is bigger than its value pipeline supports. That’s a real conclusion. But notice that it arrives last, not first, and as often as not it indicts the pipeline rather than the engineers.

One caveat to leave you with. The right size for your review queue is not zero; zero means you over-bought reviewer capacity. The right size is where the cost of more capacity equals the cost of delay, and you won’t find that point by feel. You find it by measuring. Which takes us back to step one: go look at the queue.