Software Health Monitoring: Key Metrics

Q: What is application performance monitoring (APM)?

Application performance monitoring tracks the performance of software in production, including response times, transaction throughput, slow query detection, and dependency tracing. APM tools provide instrumentation to diagnose performance problems at the code level rather than just the infrastructure level. Common tools include Dynatrace, New Relic, and Datadog.

Q: How often should monitoring thresholds be reviewed?

Monitoring thresholds should be reviewed quarterly for stable systems, after significant deployments or architectural changes, and after every post-incident retrospective. Thresholds set at launch reflect early traffic patterns that change over time. Thresholds that have never triggered may need tightening; those that trigger constantly may need adjusting to remain actionable.

Yat Badal

October 9, 2023

•

5 min read

Software systems degrade without active monitoring. Performance slows, errors accumulate, resources become constrained, and availability decreases — often gradually enough that the pattern is only visible in retrospect, after it has already affected users. Monitoring provides the visibility to catch these patterns before they become incidents.

This checklist covers the key metrics that software maintenance programmes should track across uptime, performance, errors, and resource usage. The full Software Monitoring Checklist is available as a downloadable PDF.

Why Monitoring Matters

Monitoring enables proactive maintenance rather than reactive firefighting. Without it, outages are discovered by users rather than by engineering teams. Without performance baselines, degradation is invisible until it becomes an incident. Without error rate tracking, recurring defects compound undetected.

Setting monitoring goals based on business needs is recommended before configuring tooling. The specific metrics that matter depend on the software architecture, infrastructure, team priorities, and business context. The categories below provide a framework for identifying which metrics apply to your environment.

1. Understanding Your Maintenance Types

Different maintenance activities require different metrics. Aligning monitoring with the type of maintenance being performed helps teams focus on the signals that matter for each objective.

Corrective maintenance (fixing bugs, defects, crashes): track bug rates, mean time between failures (MTBF), mean time to recover (MTTR), and availability after updates
Adaptive maintenance (new features, integrations, capabilities): track feature adoption rates, integration success rates, and performance impact of new components
Perfective maintenance (code quality, workflow optimisation, user experience): track performance benchmarks, technical debt indicators, and user satisfaction trends
Preventive maintenance (security patching, tech upgrades, redundancy): track vulnerability management status, capacity headroom, and fault tolerance metrics

Tracking metrics tailored to each maintenance category allows teams to take targeted, data-driven actions. The specific metrics will vary by software architecture and business need, but the framework holds across most production environments.

2. Uptime and Availability Monitoring

Uptime measures the percentage of time software remains functional for users. Availability encompasses both uptime and the ability to handle requests successfully when the system is running. Even brief periods of downtime translate to revenue losses, user trust damage, and reputational impact for business-critical applications.

Set an explicit availability target for each application based on its business impact: 99.9%, 99.95%, or 99.99% depending on the consequences of downtime
Configure alerts for unplanned outages immediately — not after the next deployment cycle
Track both planned and unplanned downtime separately: planned maintenance windows have different operational implications than unexpected failures
Measure availability from the user's perspective using external monitoring, not just internal health checks that may miss network or CDN failures

Goal-setting for availability percentages should be based on business requirements rather than industry averages. A payment processing system and an internal analytics dashboard have different availability implications and different justifiable monitoring investment levels.

3. Performance Monitoring

Application performance monitoring (APM) provides the data to identify bottlenecks before they affect users at scale. Performance problems that are invisible at normal load often become apparent at peak times, which is too late to address reactively.

Monitor application response times and latency across all key user-facing transactions
Use synthetic monitoring to simulate user journeys and detect performance regressions between deployments
Set performance goals for load times and transaction response times based on user experience thresholds, not just server-side metrics
Track slow query rates in databases and external API call latency, which are frequently the actual bottleneck rather than application code

Performance monitoring enables preemptive maintenance when slowdowns begin to develop. Catching a degradation trend at 20% slower than baseline is significantly easier to address than responding to a system under 5x normal load.

4. Error Rate Tracking

Error rates are one of the clearest signals of system reliability. Sustained or increasing error rates indicate accumulating problems; sudden spikes in error rate are often the first detectable signal of a deployment issue or infrastructure failure.

Log all system and application errors and categorise them by type and severity
Set alert thresholds for error rate increases, not just absolute error counts: a 2% error rate on a low-traffic endpoint may be less significant than a 0.1% rate on a critical transaction path
Analyse error trends over time to identify recurring defect categories worth addressing systematically
Track error rates before and after deployments as a deployment health signal

Reducing future errors is a maintenance goal in itself. Error rate trends over time indicate whether maintenance efforts are improving system reliability or whether technical debt is accumulating.

5. Resource Utilisation

Resource constraints are a common cause of performance degradation and unexpected failures. Monitoring utilisation across compute, storage, and network resources provides the data to right-size infrastructure and identify capacity risks before they become availability events.

Monitor CPU, memory, storage, network, and cloud resource usage across all production components
Set utilisation thresholds that trigger alerts before resources reach saturation: alerting at 80% CPU gives time to respond before 100% causes degradation
Review utilisation patterns regularly to identify over-provisioned resources and optimise costs
Monitor database connection pool usage, queue depths, and thread pool saturation, which are common failure modes that raw CPU and memory metrics do not reveal

Usage patterns uncovered through resource monitoring often necessitate software or infrastructure changes through proactive maintenance — before users experience the consequence of a resource limit being hit.

6. Custom Metrics for Testing and Support

Standard infrastructure and application metrics provide the baseline. Custom metrics tailored to the specific application and team objectives complete the picture.

For testing teams: track defect rates by component, test coverage percentages, automation level, and cycle time from defect discovery to resolution
For support teams: track ticket resolution rates by category, customer satisfaction scores, service level adherence, and escalation rates
For development teams: track deployment frequency, change failure rate, and mean time to restore as leading indicators of delivery health
Set alert thresholds on custom metrics the same way as on standard metrics: visibility without alerting produces data that nobody acts on

Tracking both standardised and tailored metrics allows teams to make data-driven maintenance decisions optimised for their specific environment. The standard metrics tell you whether the system is healthy; the custom metrics tell you whether it is healthy in the ways that matter most for your application.

7. Ongoing Monitoring as a Maintenance Practice

Monitoring is not a one-time configuration: it requires ongoing investment to remain useful. Systems evolve, traffic patterns change, and the metrics that mattered at launch may not be the ones that matter twelve months later.

Review alert thresholds regularly and adjust them as the system and its usage patterns mature
Establish monitoring benchmarks during development so you have baselines to compare against when the system is in production
Treat monitoring coverage as a quality metric alongside test coverage: new features should include new monitoring instrumentation
Review your monitoring strategy as part of post-incident retrospectives and update it to detect the conditions that caused each incident earlier

Monitoring provides the data to optimise maintenance decisions over the full software lifecycle. For more on how monitoring tools and practices are evolving, see our overview of trends in software maintenance tools and technologies.

Frequently Asked Questions

What are the most important metrics for software health monitoring?

The highest-priority metrics for most production systems are availability (uptime percentage), error rate, response time, and resource utilisation. These four dimensions cover the failure modes that most directly affect users. Custom metrics specific to your application's business logic and the particular failure modes your team has experienced historically are worth adding to this baseline.

What is the difference between uptime and availability?

Uptime measures the percentage of time a system is running. Availability is a broader measure that encompasses both uptime and the ability to serve requests successfully when running. A system can be technically up but effectively unavailable if it is returning errors, responding too slowly, or unable to handle its load. Availability measured from the user's perspective using external monitoring is more useful than uptime measured by internal health checks.

What is application performance monitoring (APM)?

Application performance monitoring is a category of tooling that tracks the performance of software applications in production, including response times, transaction throughput, slow query detection, and dependency tracing. APM tools provide the instrumentation to diagnose performance problems at the code level rather than just at the infrastructure level. Common APM tools include Dynatrace, New Relic, and Datadog.

How often should monitoring thresholds be reviewed?

Monitoring thresholds should be reviewed at regular intervals (quarterly is common for stable systems), after significant deployments or architectural changes, and after every post-incident retrospective. Thresholds set at launch typically reflect early traffic patterns and system characteristics that change over time. Thresholds that have never triggered may need to be tightened; thresholds that trigger constantly may need to be adjusted to remain actionable.

What custom metrics should software teams track beyond standard infrastructure metrics?

Beyond standard CPU, memory, and uptime metrics, useful custom metrics depend on the application and team. Testing teams typically track defect rates by component, test coverage, and automation percentage. Support teams track resolution rates, SLA adherence, and escalation rates. Development teams track deployment frequency, change failure rate, and mean time to restore. The most valuable custom metrics are those tied directly to outcomes that matter to the business or that have historically preceded problems that were caught too late.

Eliminate Delivery Risks with Real-Time Engineering Metrics

Our Software Engineering Orchestration Platform (SEOP) powers speed, flexibility, and real-time metrics.

Sign-Up to Explore