Why Your DORA Metrics Are Lying to You

March 11, 2026

•

5 mins

The DORA report arrives in the quarterly review. Deployment frequency is up. Lead time is down. Mean time to recovery sits comfortably in the High performer tier. And yet the engineering org is slower than it was two years ago, incidents cause real customer impact, and the team is exhausted.

The metrics are not broken. They are likely being measured incorrectly. DORA metrics are harder to implement well than most teams expect, and the common errors do not cancel out; they compound into numbers that look clean but explain nothing.

What DORA Metrics Actually Measure

DORA metrics are four key performance indicators for software delivery: deployment frequency, lead time for changes, change failure rate, and mean time to recovery. They emerged from Google's DevOps Research and Assessment program and were formalized by Nicole Forsgren, Jez Humble, and Gene Kim in Accelerate (2018). The DORA State of DevOps 2024 report draws on a decade of data across thousands of organizations.

The key word is proxies. These four numbers are a window into delivery performance, not the thing itself. Proxies can be gamed, misapplied, and stripped of context in ways the underlying reality cannot. That is where most implementations go wrong.

Five Ways DORA Metrics Mislead Engineering Teams

1. Goodhart's Law Takes Over

When deployment frequency becomes a team target, engineers find ways to hit the number without improving delivery. Small, low-risk changes get split into separate deploys. Feature flags deploy to production but never turn on. Change failure rate falls as teams reclassify incidents below the reporting threshold.

This is Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. All four DORA metrics can be optimised superficially without improving the delivery capability they were designed to reflect. When you set targets against them, that is exactly what happens.

2. No One Is Measuring the Same Thing

Where does lead time for changes start? The DORA definition is from code commit to production. Many teams measure from ticket creation, from branch creation, or from first pull request. Some include deployment approval wait time; others do not.

Change failure rate requires defining what counts as a failure: a P1 incident? Any rollback? Any hotfix pushed within 24 hours of a deploy? When definitions vary across teams (which they do, almost universally), the numbers are not comparable. A team reporting 2-hour lead time may be measuring something fundamentally different from a team reporting 6-hour lead time.

3. The Tiers Were Not Built for Your Context

DORA's performance tiers reflect cross-industry benchmarks. The 2024 State of DevOps report classifies Elite performers as deploying on demand or multiple times per day. That threshold makes sense for a SaaS startup deploying consumer features. It is a category error applied to a FinTech platform operating under change advisory board requirements and FCA operational resilience rules, where a single deployment involves a 48-hour approval window.

Compliance windows, regulated deployment environments, and enterprise release trains are not engineering failures. They are constraints. Benchmarking a regulated bank's deployment frequency against an Elite threshold tells you about the organizational context, not the engineering team's capability.

4. DORA Captures the System, Not Just the Team

Lead time from commit to production is under engineering's partial control. But if a feature requires legal review, procurement sign-off, or executive approval before release, that time appears in the metric alongside engineering throughput. The metric captures the whole system.

This is appropriate for organisational analysis. It is misleading for team-level performance conversations. A team with strong CI/CD practices and fast code review can still show poor lead time because of a structural constraint upstream. Treating that as an engineering problem produces the wrong interventions.

5. The Benchmarks Carry Survivorship Bias

Organizations contributing data to DORA research are not a random sample. They are organizations that opted into a structured DevOps assessment, which selects for teams already paying attention to delivery performance. The correlations between high DORA scores and organizational performance outcomes are real, but they were identified in companies that had the cultural and structural foundations to make those metrics meaningful.

Implementing the measurement framework without those foundations produces numbers that look like DORA data but do not carry the same predictive value. High deployment frequency in an organization without strong review culture or psychological safety is measurement theater.

How to Fix Each Problem

Decouple metrics from targets. Track DORA metrics for trend analysis and visibility; do not set frequency or lead time targets. Pair deployment frequency with change failure rate so gaming one without the other becomes visible.
Write down your definitions before collecting data. Document where lead time starts and ends, what constitutes a failure, and what counts as a deployment. Review these definitions when someone new joins the team or when results look unexpectedly good.
Benchmark against your own history, not the tiers. "Our lead time improved from 8 days to 5 days over six months" is useful. "We are in the Medium performer tier" may not be, if your organizational constraints make Elite thresholds structurally inaccessible.
Label what the metric includes. If change approval time inflates lead time, track it separately. Engineering throughput time and total cycle time are both useful; conflating them obscures where the bottleneck actually sits.
Use DORA as a compass, not a scorecard. The research tells you what delivery capability looks like in high-performing organizations. It does not tell you that hitting Elite thresholds will make your organization high-performing. The causation runs the other direction.

The Point Is Not to Stop Measuring

DORA metrics, measured consistently over time and against your own baseline, remain among the best available tools for understanding software delivery performance. The 2024 State of DevOps report found that Elite performers are 4x more likely to meet their organizational performance targets. That correlation is real and holds across a decade of data.

The point is to measure carefully. Consistent definitions, team-level context, trend direction over tier benchmarking, and quality signals paired with each metric produce data teams can actually use. Inconsistent definitions, target-driven gaming, and decontextualised comparisons produce numbers that look like DORA data but explain nothing.

If your DORA metrics look clean and delivery still feels broken, the numbers are probably accurate. The problem is in how they are being interpreted.

For a grounding-level overview of what each metric measures and how the four work together, see the DORA metrics guide. For specific benchmarks, see the deployment frequency benchmarks and mean time to recovery benchmarks. For guidance on presenting these metrics to CTOs and CFOs, see the engineering metrics dashboard guide.

Frequently Asked Questions

What are the most common DORA metrics measurement mistakes?

The most common mistakes: setting targets against DORA metrics (which triggers Goodhart's Law), using inconsistent definitions across teams (making numbers incomparable), benchmarking against DORA tiers without accounting for organizational constraints, and conflating system-level lead time with engineering throughput time.

Can DORA metrics be gamed?

Yes. All four can be optimised superficially without improving underlying delivery performance. Deployment frequency rises when teams split changes unnecessarily. Change failure rate falls when incidents are reclassified. Lead time improves when the measurement start point is moved. The fix is to decouple metrics from targets and pair each metric with a complementary signal that is harder to game simultaneously.

How do you implement DORA metrics correctly?

Start with written definitions: where lead time starts and ends, what counts as a deployment, what counts as a failure. Collect data for three to four months before setting any targets to establish a clean baseline. Track trend direction rather than absolute tier position. Review metrics in retrospectives where the team can discuss what drove changes, not in management reports where numbers are compared without context.

Are DORA metrics relevant for regulated industries like banking or FinTech?

Yes, but they require contextual interpretation. Deployment frequency thresholds in DORA's Elite tier assume environments without regulatory change advisory requirements. FinTech and banking teams should benchmark against their own historical baselines and separate engineering throughput time from compliance approval time, which is a system constraint rather than an engineering failure.

What should you do if DORA metrics look good but delivery performance feels poor?

Audit your definitions first. Verify that lead time, deployment events, and failure classifications are measured consistently and match the DORA definitions. Check whether targets have been set against the metrics, which may have triggered gaming. Then look at what the metrics exclude: DORA measures delivery speed and stability but not feature value delivered, technical debt accumulation, or team sustainability.

Tracking DORA metrics at the team level without turning them into surveillance tools? The Scrums.com engineering intelligence platform gives engineering leaders trend analysis, DORA tier tracking, and delivery visibility without individual performance scoring. Or start with the engineering operations guide for broader context on how delivery metrics fit into the picture.