Back to Thoughts
May 23, 2026

Measuring the Agentic SDLC

Often, the conversation around AI in software delivery has narrowed into a single question: can we ship code faster? And because that’s the question everyone is asking, it’s the question most teams try to answer when they set up their metrics. Lines of code going out. Tickets closed. Deployment frequency.

None of those numbers tell you whether the rollout is actually working.

The reason is that they’re measuring output, not outcomes. Output is easy to measure because it’s visible. Outcomes require you to have defined what you were trying to achieve before you started, and many organizations skip that part entirely.

The Funnel Starts Above the IDE

The first mistake in most agentic SDLC rollouts is treating it as an engineering problem. Hand the dev team better tools, tell them to use AI in their workflow, and track the output. The engineers go faster, or they don’t, and you’ve got your answer.

That framing misses the most important variable in the system, which is the quality of the work coming into the team in the first place.

I’ve written before about what I see as the real bottleneck in most engineering orgs. It isn’t coding speed. It’s that the work arriving at the team lacks basic context about what it’s for and what success looks like. That was true before AI and it’s still true with it. No amount of model capability translates a poorly defined problem into a good solution faster.

Here’s a useful test. Take a story out of your current backlog and ask: could someone build this correctly using only what’s in that ticket, the codebase, and your documented engineering standards? Not someone with years of institutional context baked in. Someone capable, but starting cold. If the answer is no, if completing the work requires reaching for knowledge that isn’t written down anywhere, then you haven’t defined the work well enough to hand it to an agent in any meaningful autonomous way. You’ve just handed the problem to a developer along with a more powerful typing tool.

The funnel starts at product definition. What do feature requests look like before they become stories? What do the stories contain? Are the acceptance criteria objective or interpretive? The returns from autonomous code generation scale directly with the quality of the input it’s working from. Fix the input first.

Define the Measurements First

Next is to define what success looks like before you start measuring anything.

It may sound pretty obvious, but unfortunately it’s not consistently practiced.

“We’re shipping more code” is not a success definition. Neither is “the team is moving faster.” Those describe activity. A KPI is specific: if this metric, or this group of metrics, looks like this, then we can assume the rollout is working. If it looks like that, it isn’t. You need to know what you’re looking for before the data starts coming in, because humans interpret data, and without defined context for what the metrics actually mean in your specific rollout, that interpretation will drift toward whatever story is most convenient. Define the context of the data being used to measure with, upfront.

There’s a range of options, each reflecting a different level of rigor about what you’re actually trying to accomplish.

At the loosest end: features get built and they technically work. That’s a floor.

Probably not what you’re aiming for, but worth being honest if that’s what you’re willing to accept as validation.

Tighter: features shipped with fewer than X percent of bugs reported by clients within the first 15 days.

Now you’ve tied velocity to quality, and the source of the report matters. A bug caught by internal QA before release is a different signal than one a client finds in production. Both are worth tracking, but conflating them obscures what your process is actually catching and what it’s missing. A feature that ships fast and breaks in a client’s hands hasn’t helped you.

Which definition fits depends on what you’re actually trying to accomplish.

Output is easy to celebrate. Outcomes require you to have known what you were looking for.

The Metrics Trap

Even when teams attempt to measure, the most common failure mode is picking metrics that are directionally related to performance but structurally incomplete.

Lines of code going up is a signal of activity, but it’s one of the harder metrics to judge in an agentic SDLC context, and here’s why. A skilled human engineer might use meta programming to ship an entire feature without adding significant lines of code, because the point of that approach was always to make the code more efficient and maintainable. An AI agent largely doesn’t get used that way. It stamps out code because token count isn’t a concern for it. So more lines doesn’t mean better, and fewer lines doesn’t mean worse.

That said, if the error rate is also climbing alongside output, the question isn’t whether errors went up in absolute terms. Ship more code and you’ll probably see more errors. What matters is whether the error rate as a percentage of total output is holding steady, improving, or getting worse. Five errors for every five thousand lines of code is the same rate as ten errors for every ten thousand. If that rate stays consistent as output scales, you’re at least not losing ground.

But even that framing misses the most important variable, which is where the engineering time is actually going.

Consider this: you ship code faster, so deployment frequency goes up. But your error rate goes up too. Now consider that the error rate climbs high enough that your team is spending more time fixing problems than shipping new features. To understand whether that’s actually happening, you need to look at cycle time per error, how long does it take from the moment an error is identified to the moment it’s resolved and back in production. Then compare that total time investment across all errors against the time going toward net new features. That ratio is what tells you whether your team’s increased output is going toward outcomes the business actually wanted, or toward cleaning up problems introduced after the rollout.

The ratio of error remediation work to net new feature work is one of the most telling signals in an agentic SDLC rollout. Some questions worth having answers to:

  • What percentage of total engineering time is going toward net new features?
  • What percentage is going toward error remediation, including testing, fixing, and re-releasing?
  • Is that ratio improving over time, holding steady, or getting worse?
  • Are you classifying errors consistently enough to track trends across time?
  • Is technical debt accumulating faster than it was before the rollout?

If the answers to any of those questions are “we’re not tracking that,” then it’s genuinely challenging to state for certain that the rollout is working. You might have a story, but it may not actually match reality.

The compounding interest of technical debt is the one that gets overlooked most often. Errors that don’t get addressed don’t sit still. They interact with new code, create surface area for new failures, and consume progressively more bandwidth to manage over time. A rollout that looks like a velocity win in month two can turn into a maintenance drag by month six if nobody is paying attention to the overall technical debt accumulating underneath it.

DORA Is a Framework

DORA metrics (deployment frequency, lead time for changes, change failure rate, time to restore) are a genuinely useful starting point. They give you a structured way to think about delivery performance. The problem is that how you define what falls into each category, and what you choose to focus on, determines whether any of it tells you something real.

Deployment frequency going up is a positive signal in isolation. Deployment frequency going up while change failure rate also climbs is a more complicated story. Is a bug reported in the first 24 hours a failure? The first 15 days? Does severity factor in? The definition isn’t in the DORA framework. You have to supply it. If you don’t, you’re measuring the word, not the thing.

The same applies across the rest of the framework. If you’re getting faster at deploying code but the code is failing more often and taking longer to recover from, the aggregate picture is not one of improvement. You’ll never see that picture if you’re celebrating deployment frequency in isolation.

The organizations getting real value from these metrics treat them as a system. They’ve defined what each category means in their specific context, they track the relationships between metrics rather than individual numbers, and they ask whether the trends make sense together, not just whether any one number went up.

Asking the Right Question

If the conversation started with “can we ship code faster,” it’s the wrong question. If lines of code is the only thing you’re measuring, you’ll get an answer that tells you very little about whether a rollout of Agentic SDLC actually worked.

The right question is whether you set up the measurements to know. Did you define the work well enough for the funnel to function? Did you establish what the metrics meant in your context before the data started coming in? Did you track where the time was actually going, not just how much code went out the door?

If you did, the data will tell you what’s working and what isn’t. The discipline of setting context before interpreting signal is what separates a rollout that actually improved delivery from one that just generated a more compelling story about it.

Output is easy to celebrate. Outcomes require you to have known what you were looking for.