Secure, Reliable Software

Mean Time to Recovery for FinTech Teams

Q: How should payment teams account for third-party recovery time in MTTR?

Third-party dependencies can extend recovery time beyond internal control. Track MTTR with and without third-party incidents separately. Log incidents caused by third-party failures so leadership can assess where SLA improvements or redundancy investments are needed. FCA incident reporting typically requires this distinction.

Scrums.com Editorial Team

March 8, 2026

•

8 mins

A payment outage triggers two parallel clocks. The first measures how long your engineering team takes to restore service. The second measures your regulatory exposure: when it started, how long it lasted, and whether you need to notify the FCA.

The DORA research program calls recovering from an incident in under one hour "Elite" performance. For payment processors handling card-present transactions, recovering in under one hour is not Elite. It is the expected floor. Teams sitting in DORA's High tier (mean time to recovery under one day) are unlikely to meet their operational resilience impact tolerances for payment services.

This guide covers DORA's four-tier MTTR benchmarks, why payment systems require tighter recovery targets than the general software industry, what drives recovery time in payment infrastructure, and how regulated engineering teams reduce mean time to recovery without cutting corners on reconciliation.

What Is Mean Time to Recovery?

Mean time to recovery (MTTR) is the average time from incident detection to full service restoration. It is one of the four DORA metrics, alongside deployment frequency, lead time for changes, and change failure rate. DORA research uses these four metrics to benchmark software delivery and operational performance across engineering teams.

MTTR is calculated as total incident downtime divided by the number of incidents across a measurement period. A team that experienced three incidents last quarter lasting 45 minutes, 2 hours, and 90 minutes respectively has an MTTR of 85 minutes for that period.

Two measurement boundaries define MTTR accuracy:

Start time: when the incident is detected via automated alerting or an initial user report, not when it is acknowledged, assigned, or escalated.
End time: when the system is fully operational, including transaction reconciliation, not when the first successful transaction processes post-recovery.

The reconciliation boundary matters in payment systems. A payment service can begin accepting card transactions while thousands of prior transactions remain unresolved. Measuring recovery at first-transaction rather than full-reconciliation understates the true incident impact and the actual time the business was exposed.

DORA MTTR Benchmarks

The DORA State of DevOps 2024 report (the last to use the four-tier Elite/High/Medium/Low framework) sets the following MTTR thresholds:

DORA Tier	MTTR Threshold	What It Looks Like in Practice
Elite	Less than 1 hour	Automated alerting triggers runbook. On-call engineer restores service within the hour. Common in teams with pre-approved rollback procedures and mature incident response.
High	Less than 1 day	Incident resolved within a working shift. Manual investigation required but runbooks and distributed tracing accelerate root cause. Minimum viable for most regulated payment services.
Medium	Between 1 day and 1 week	Incidents require cross-team escalation or complex data recovery. Unacceptable for active payment processing flows under FCA operational resilience requirements.
Low	More than 1 week	Major system failure requiring architecture-level intervention. For payment firms, any incident in this tier is a reportable regulatory event and likely a customer redress situation.

Source: DORA State of DevOps 2024. Note: the 2025 DORA report retired the four-tier framework in favour of seven team archetypes.

A note on the framework: the 2025 DORA research program retired the four-tier classification and moved to seven team archetypes based on the intersection of delivery performance and human factors including burnout and organisational friction. The 2024 thresholds above remain the primary industry reference for tier-based MTTR benchmarking because they were the last published in a comparable format.

Why Payment Systems Need Tighter Recovery Targets

DORA benchmarks describe delivery performance across a broad range of software engineering teams. Payment systems operate under a different set of constraints.

Revenue loss runs per minute, not per hour

Gartner estimates the average cost of IT downtime at $5,600 per minute for enterprise organisations. For payment processors, where each unavailable minute represents failed transactions, merchant compensation obligations, and potential card scheme penalties, the exposure is typically higher. A two-hour outage during peak trading hours carries a financial cost that outpaces most quarterly engineering budgets.

Regulatory reporting obligations start early

Under the FCA's Payment Services Regulations and the Bank of England's operational resilience rules, payment firms must notify their regulator of major operational incidents. FCA policy statement PS21/3 requires firms to identify their important business services (which, for a payments firm, includes card processing, account access, and payment initiation) and set impact tolerances for each service. These tolerances define the maximum time a service can be disrupted before the firm is in breach.

Firms had until March 2025 to operate fully within their stated tolerances. For most payment services, those tolerances are measured in hours, not days.

Customer obligations add to the timeline

Beyond regulatory notification, payment firms face chargeback processing for failed transactions, settlement reconciliation with card networks, and customer communication obligations. None of these can run until the system is stable. The longer the incident, the larger the reconciliation backlog after recovery.

What Drives Recovery Time in Payment Systems

Several factors push MTTR higher in payment infrastructure compared with general-purpose web applications.

Transaction state complexity: Payment systems maintain distributed state across internal databases, card networks, and third-party processors. A recovery that does not account for all in-flight transactions can leave reconciliation gaps that take hours to close.
Database consistency requirements: Financial data requires ACID compliance (Atomicity, Consistency, Isolation, Durability). Rolling back a deployment in a payment system is not as simple as reverting to a prior container image. The data state must be consistent before the service is considered recovered.
Card scheme dependencies: Recovery of the internal system does not equal recovery of the payment experience. If an incident involves a card network or banking partner, recovery time is partly outside the engineering team's control.
Regulatory notification workflow: Incident managers in regulated firms must balance the technical recovery effort against the parallel obligation to notify regulators within mandated timeframes. This coordination overhead extends the incident timeline.
Conservative traffic restoration: In high-volume payment pipelines, engineers are often cautious about when to declare full recovery. Processing a backlog of queued transactions too quickly can saturate downstream systems. Controlled traffic release adds time to measured MTTR even when the core fault is resolved.

How to Measure MTTR Accurately

Reported MTTR figures are only useful if teams measure them consistently. Common errors inflate or deflate the metric in ways that hide real performance.

Use detection time, not report time

Incidents are often reported by customers or business teams before they appear in engineering monitoring. Using report time rather than detection time (from alerting systems) understates MTTR. Teams building toward FCA resilience requirements need to demonstrate automated detection capability, which means measuring from the actual detection event.

Define recovered consistently

For payment systems, recovered means: all transaction paths operational, no pending reconciliation gaps, no degraded fallback mode running. Partial recovery (processing new transactions while a reconciliation job runs overnight) should be logged as degraded, not recovered.

Track per service, not per environment

A payment platform typically spans multiple services: authentication, transaction processing, fraud screening, and settlement. MTTR measured across the entire platform masks which service is the reliability bottleneck. Track MTTR per service and per incident category (infrastructure failure, code defect, third-party dependency) to identify where to invest.

Reducing MTTR: Where Teams Focus

Teams that consistently hit Elite MTTR in payment systems tend to invest in a specific set of capabilities rather than broad engineering process changes.

Runbook automation: Known incident types (database connection pool exhaustion, certificate expiry, upstream API latency) should have automated or semi-automated runbooks. Manual diagnosis of known issues is the largest single driver of extended MTTR in mature teams.
Pre-approved rollback procedures: Rollbacks in financial systems require sign-off processes that, if unplanned, can take longer than the technical rollback itself. Pre-approving rollback to N-1 for all production deployments removes a decision bottleneck from the critical path during an active incident.
Distributed tracing: Payment systems span many services. Without distributed tracing, engineers spend the early minutes of an incident mapping where the fault originated. Tracing tools that correlate transaction IDs across services compress diagnosis time significantly.
Incident command structure: Clear role assignment (incident commander, communications lead, technical lead) prevents coordination overhead that extends incidents in teams without a defined structure. Teams operating under FCA resilience frameworks often have this as a formal requirement already.

Engineering intelligence platforms like Scrums.com track MTTR and change failure rate alongside deployment frequency and lead time, giving engineering leaders a full view of DORA performance in one place. FinTech software development organisations building on regulated payment infrastructure can use this view to manage recovery performance alongside their FCA operational resilience obligations. See our DORA metrics guide and DORA metrics for FinTech teams for the broader framework context.

Frequently Asked Questions

What is a good MTTR for engineering teams?

According to the DORA State of DevOps 2024 report, Elite teams recover from incidents in under one hour. High-performing teams recover in under one day. For payment systems, recovering in under one hour is the expected standard for core transaction processing. Teams in the Medium or Low tier (MTTR over one day) are likely in breach of their FCA operational resilience impact tolerances for payment services.

How is MTTR different from MTBF?

Mean time to recovery (MTTR) measures how long it takes to restore a system after a failure. Mean time between failures (MTBF) measures how long a system runs between failures. MTBF is a reliability metric; MTTR is a recovery capability metric. For payment systems, both matter, but DORA research focuses on MTTR because it reflects what engineering teams can directly control through deployment practices, runbooks, and incident response design.

Do DORA metrics apply to payment system engineering?

Yes. DORA metrics apply to any software engineering team, including those building payment systems. The four metrics (deployment frequency, lead time for changes, change failure rate, and MTTR) measure delivery performance regardless of industry. Payment teams often find their DORA benchmarks lag behind general-purpose software teams due to the compliance overhead and conservative change management required in regulated environments. That gap is a measurable improvement opportunity.

What does FCA PS21/3 require for MTTR?

FCA PS21/3 does not set a specific numeric MTTR threshold. Instead, it requires firms to identify their important business services (which includes payment processing for payment firms) and set their own impact tolerances for each service. These tolerances define the maximum disruption a service can sustain before causing intolerable harm to customers or market stability. Firms were required to operate fully within their stated tolerances by March 2025.

How should payment teams account for third-party recovery time in MTTR?

Third-party dependencies (card networks, banking APIs, fraud screening services) can extend recovery time beyond what the internal engineering team controls. Track MTTR with and without third-party incidents separately. Log incidents caused by third-party failures with the external dependency noted, so leadership can assess where SLA improvements or redundancy investments are needed. FCA incident reporting will typically require this distinction.

Payment system engineering requires DORA-grade delivery performance and a clear understanding of where your MTTR sits against both industry benchmarks and your own regulatory obligations. The DORA 2024 benchmarks provide the reference point. The FCA resilience framework sets the floor.

For the broader engineering operations context, see our engineering operations guide. Track your full DORA performance with Scrums.com. For FinTech-specific delivery benchmarks, see our DORA metrics for FinTech teams guide. For engineering velocity alongside MTTR, see the velocity measurement guide. For avoiding common DORA measurement pitfalls, see why DORA metrics mislead teams. For presenting DORA data to executives, see the engineering metrics dashboard guide.