Practice Makes Perfect, But Does It Really Work In an Actual Crisis?

One of the primary achievements of IT Operations and Management is the ability to execute tasks quickly and efficiently, ensuring seamless system uptime and functionality. Various monitoring systems are in place to inform the network operations center (NOC) about the health of systems and potential issues. These monitors often generate warnings that, depending on the tuning of your monitoring systems, can become quite frequent, leading to alert fatigue. However, the NOC primarily focuses on high-priority alerts, which can sometimes appear without prior warning due to the aforementioned alert fatigue. When these critical alerts occur and systems go down, it can significantly impact service availability, customer experience, and overall company reputation. To mitigate such events, IT professionals typically conduct tabletop exercises or run-throughs. But do these exercises really work in an actual crisis?

The truth is the effectiveness of these exercises in a crisis situation can vary. It largely depends on whether the cause of the unplanned event is known and if there are documented procedures, or runbooks, outlining the issue, resolution tasks, responsible owners, resolution timelines, and status updates. However, during many unplanned events, standard runbook procedures may not be feasible due to the urgency of the situation and the absence of a clear, identifiable cause. Here are a few things to consider when preparing for unplanned events:

Practice event versus real world event

Most IT professionals are very familiar with running tests in non-production environments. And try as we might to come up with every feasible scenario, simulating the real-world complexities can be quite difficult due to a number of unexpected variables. On the upside, practicing, testing and documenting results do provide experience and allow for planning and fine tuning in a real-world event.

Communication and coordination are essential

In my experience, having a communication and incident coordinator lead is critical as this person will drive an event and provide reporting to management and customer facing personnel. It’s advisable to open dedicated channels for communication and conduct a roll call of all required personnel that need to be engaged to assist with resolving the issue (or to get systems back in an online state). Practicing communications and coordination in advance of an actual event will increase confidence and allow for better preparedness.

Continuous Improvements

Learning from being prepared for unplanned events is a huge takeaway. The more we can learn from what happened, why it happened and how to prevent in the future provides the ability to ensure the event is not recurring. This is also an opportunity to improve procedures, train personnel, and suggest system level improvements or recommendations.

Scalability and Adaptability

Planning for a variety of scenarios such as routine maintenance, disaster recovery, emerging threats (i.e. ransomware attacks), technology updates (i.e. software upgrades) are critical to ensuring ample preparedness. And each of these may be different depending on the different IT infrastructure. For on-premises you may focus more on physical security breaches, hardware failures, or network issues. For cloud-based infrastructure, there may be more focus on cloud service outages, data synchronization issues, and cloud security breaches.

Conclusion

Practice may not always be perfect, but tailoring tabletop exercises to the specific needs and complexities of the IT environment can ensure that IT Ops teams are well-prepared to handle a wide range of scenarios and changes. This adaptability is crucial for maintaining resilience and minimizing the risk of unplanned downtime.


Cadent Solutions has the expertise to assist you with your business process optimization (BPO), software upgrade lifecycle, and organizational change management. Contact us for a complimentary assessment of your business processes and network infrastructure upgrade needs.

Categories: Blog
X