Case Study
Refactoring uptime metrics to align with the business model accurately
Measuring system uptime in a SaaS environment does not always follow the default industry standards.
Challenge
Uptime metrics were skewed as the client’s operating model was based on critical days of the week and hours of the day versus the industry standard metric of 24/7. Additionally, the client required a differential calculation between the mean time to restore and the mean time to recover. While this is often the same, the client defines restore as getting the system back online and recover to mean all impacted systems were fully functional with accurate data, reporting, etc. Additionally, the client’s current model assumed more impact than occurred, meaning they overstated the number of customers impacted by a particular incident. The client was most concerned with incident severity levels of 1 of 2, their most impactful unplanned outages.
Solution
Developed a new uptime calculation addressing the client’s needs. We reviewed the historical uptime metrics and revised them based on a model that considered the number of outages, impacted customers, outage duration, and concurrent sessions (or logins).
The new uptime calculation algorithm allowed for the flexibility to focus on both the day of the week and the hour of the day while continuing to support the more traditional 24/7 model, as the client wanted to have the option to present this level of detail to their management team.
We also enhanced the weekly and monthly management reports and graphs to align with the new calculations, created a basic plug-and-play spreadsheet to maintain historical metrics, and recommended daily management dashboards.
Results
The new uptime metric saved the client tens of hours of refactoring their weekly and monthly uptime metrics, provided more accurate reporting, and enhanced management confidence in system availability during critical business hours.
IT Procedure Management services utilized
Business process optimization, Change management