What Metrics Should DevOps Teams Be Tracking?

When speaking with software leaders across the industry, one particular topic that peaks attention is the subject of software delivery metrics. The DevOps Research and Assessment (DORA) — an academic research group now part of Google Cloud — have deemed 4 metrics as the most essential metrics to track after they conducted 6 years worth of surveys (2014-2019).

By all means, these metrics are far from perfect. But they allow teams to benchmark their software delivery performance from Elite to Low performing. DORA metrics are:

  1. Deployment frequency
  2. Lead time for changes
  3. Time to restore service
  4. Change failure rate

The term DORA metrics is still relatively new to the industry. So as an advocate for greater optimization, the lack of awareness surrounding DORA indicates that there’s tons of work to be done in advocating their use.

The benefits of adopting DevOps best practices are clear. Teams are far more likely to be more profitable and successful businesses when such practices like collecting DORA metrics are followed.

Our take is that there’s a hunger to gain access to such metrics by CTOs and technical leadership. So this article aims to summarize the 4 key metrics of DORA, with each header having its own section.

Deployment Frequency

This metric is simple. It’s essentially a measure of the frequency of deployments to production for the primary app a software team is working on.

According to the 2018 report, Elite-performing teams have multiple deployments per day. For example, engineers at Amazon were deploying code every 11.7 seconds on average after their move to AWS -- now that’s Elite status amongst Elite.

The deployment frequency metric indicates the value that the end-user receives. This is where business value such as revenue, feedback, or even experimentation can be collected.

Lead Time to Changes

When looking at this metric, teams gain access to the length of time it takes to move from code commit to code successfully running to production. Elite-performing teams — meaning advanced throughput and stability — really do stand out here as their lead time for changes is just 1-hour or under.

That’s a significant difference even to high-to-medium performing teams who often need 24 hours or less.

If you’re aren’t thinking about it already, the two metrics discussed so far are well related to each other as they’re both measures of throughput. So it’s only natural to also include measures of stability in DORA to fine-tune our grasp of software delivery.

Time to Restore Services

The time to restore services (often also referred to as Mean-Time to Restore Service or MTTR) is the average measure of time it takes to get service back-up when incidents occur.

Elite teams tend to be 2,604 times faster to recover than low performers. So even in this measure of stability, speed remains a significant factor.

From a business standpoint, it’s absolutely crucial to have services restored quickly to remain competitive, especially in the world of subscription pricing.

Change-Failure Rate

The change-failure rate is the percentage of deployment failures that occur in production that require immediate remedy (for example, service outages or rollbacks).

According to the 2018 report, low-performing teams reported a mean change-failure of 7.5% relative to a low performing team that represented an average of 53 percent. That’s 7x slower!

Data from CircleCI users suggests that teams using continuous integration have a 50% chance to recover in under an hour; a number that gives such teams Elite status.

The key for truly effective teams, however, is being able to learn on such failures and apply on key lessons learned from failure to reduce the likelihood of such errors.

Takeaways

We know from the 2019 State of DevOps report that large differences between Elite and Low performers are partly due to Elite teams not just out-performing in delivery metrics, but also out-investing their low-performing counterparts in strategic CI/CD investments to gain competitive advantages over their competition.  
But if there’s one part that raises eye-brows in toolchain investment is investment pertaining to Slack and CI/CD.

69% of Elite performers invest time and effort to integrate SlackOps in the deployment process. That’s huge!

Slack is a popular tool among software developers. So the idea of extending Slack Ops to deployments represents an interesting trend among industry elites to gain agile and cultural advantages in the software development process.

And as teams go more remote, teams need to be more agile.

As CTO.ai CEO and founder Kyle Campbell mentioned in Forbes recently, remote teams using Slack face difficulties sharing technical shortcuts in Slack’s chat interfaces as it is a historically complex, insecure, and time-consuming challenge.

Therefore, teams have to find better ways and the good news is that Slack’s wonderful list of integrations allows engineers to collaborate across development, testing, and deployment.

So it’s obvious (especially during COVID-19) that remote tooling and software delivery metrics will make companies significantly better off and that's the future of DevOps. The current pandemic has accelerated this shift. And while it’s likely that the future of work in all industries lies in a more remote world, it’s more likely that software engineering will permanently be transformed.