Home / Blog / Consequences of IT Disruptions: What’s your Business Continuity & Disaster Recovery Plan?

Consequences of IT Disruptions: What’s your Business Continuity & Disaster Recovery Plan?

OpsGuru • Jul 31, 2024

Consequences of IT Disruptions: What’s your Business Continuity & Disaster Recovery Plan?

On July 18, financial institutions, airlines, emergency services and medical services worldwide were impacted by what experts deem the most significant IT outage in history.

The OpsGuru team stepped in and acted quickly to help customers impacted by the CrowdStrike Falcon outage recover their environments and return online. Having gained a deep understanding of both the bug itself and the industry response, which impacted 8.5 million machines globally, we understand that solving a problem of this magnitude requires hands-on technical skills, which are often beyond the reach of the average user, causing an increased delay in business continuity.

This security tool operates in kernel mode on Windows, and at this level, an application error can spread to a total system failure. However, a mistake of this size clearly can’t be solely connected to a pure programming error. Its actual cause relies on the testing and quality assurance regimes adopted by the players involved in this unprecedented outage. Especially as this bug was not connected to software developed in-house but was coming from a vendor, this issue dovetails into the renewed focus on the more complicated supply-chain management and security practices we’ve seen taking shape in the industry, emphasizing the need for robust vendor selection processes and risk management strategies.

How Can You Protect Your Business?

Businesses analyzing outages should be asking themselves one or both of the following questions depending on their business:

How can our business, which provides technical products to clients, minimize the possibility of creating such an outage with our products?
How can we prevent our business IT infrastructure from being affected by an outage like this?

Answering these questions requires a holistic understanding of current business processes and is far from trivial. We’ve seen large companies halt in heavily regulated sectors, such as finance or aviation, signalling that even companies known for risk avoidance and a slow change pace cannot guarantee to prevent outages. At OpsGuru, we pride ourselves on our deep expertise in the technical and business domains. We firmly believe that a comprehensive reliability framework should cover both.

Let’s dive into how OpsGuru can help your organization address each of these questions.

How can our business, which provides technical products to clients, minimize the possibility of creating such an outage with our products?

Technical Mitigation strategies

Adopt a rigorous Pre-Deployment Testing strategy
- Simulated Real-World Environments: Updates should be tested in environments that closely mimic real-world conditions, including various hardware configurations and software setups. This helps identify potential conflicts and stability issues.
- Stress Testing: Performing stress tests on updates can reveal how they behave under extreme conditions, ensuring they can handle high loads without causing system failures.
Use Phased Rollouts
- Define Rollout Policies: Disable automatic updates for external software vendors and define an update policy that works for your enterprise. Deploy updates in stages, starting with a small subset of users or systems. Monitor the update’s performance before rolling it out to the entire user base. This approach minimizes the impact of potential issues.
- Canary Releases: Use canary releases, where the update is initially deployed to a small group of users who serve as a test group. The update can be rolled out more broadly if no issues are detected.
Establish Enhanced Monitoring and Alerting
- Proactive Monitoring: Implement robust solutions like Prometheus, Nagios, or Datadog to continuously monitor system health and performance. Set up alerts for any anomalies or errors detected.
- Automated Alerts: Configure automated alerts to notify IT teams of potential issues as soon as they are detected. This allows for quick intervention before problems escalate.
Have a Strong Incident Response Approach:
- Runbooks and Playbooks: Develop detailed runbooks and incident response playbooks to guide IT teams through the steps to take when an issue is detected. These documents should be regularly updated based on past incidents and lessons learned.
Create Comprehensive Redundancy and Failover Plans
- Multiple Instances: Ensure critical systems have redundant instances running in parallel. If one instance fails, others can take over without interruption.
- Automatic Failover: Implement automatic failover mechanisms that detect when a system fails and automatically switch to a backup system. This ensures continuity of service even during outages.
Effective Communication and Documentation
- Transparent Updates: Communicate clearly with users about upcoming updates, potential impacts, and steps being taken to mitigate risks. Provide detailed documentation on how to recover from any issues that might arise.

Organizational Mitigation Strategies

Create A Robust Vendor Selection Process
- Vetting: Ensure that you perform thorough due diligence on each vendor you onboard in the company. This includes reviewing software systems and procedures.
- Liability Provisions: Evaluate, control and include explicit liability provisions in vendor contracts whenever possible. Especially with vendors of live, business-critical services
- Architecture Minimal Requirements: Establish at a company level a set of minimal requirements that each software vendor should have (e.g., a customizable update policy, integration with existing operations and automation solutions, etc).
Manage Supply Chain Risk
- List of Vendors: Create and maintain a list of all the vendors of critical systems you have deployed in your company’s systems.
Manage Concentration Risk
- Exit Strategy: Create and maintain a clear exit strategy for all your vendors, especially business-critical vendors. In some industries, this is a common regulatory requirement.
- Establish a relationship with a second vendor: After identifying a critical supplier, make sure to work on identifying and establishing a relationship with it and, wherever possible, also onboarding alternative suppliers so that the switching costs and times are clear.

How can we prevent our business IT infrastructure from being affected by an outage like this?

Modern architecture patterns such as serverless computing, Web Assembly (WASM), containers, and thin clients render full-featured operating systems. These technologies streamline system complexity and reduce the need for third-party endpoint protection, such as the CrowdStrike Falcon agent. Additionally, recovering these systems from failures is significantly less complex and can often be fully automated. Migrating existing, critical systems to modern architecture, as well as planning new initiatives with modern methodologies, can have a significant, positive impact on the level of exposure your IT infrastructure will face in the future.

Clear Path Forward for Business Continuity & Disaster Recovery

The effort required to implement the strategies outlined above can be large and daunting. OpsGuru can help offload that challenge by planning, developing, and deploying strategies to meet the above goals. We work with your existing staff members to identify the best course of action to meet each business's needs and create solutions designed to be transparent and approachable to your staff, building understanding and trust with the processes and tools required to reduce risk.

Protect your business operation with OpsGuru’s Business Continuity and Disaster Recovery Services (BCDR). Clear Path Forward for BCDR considers the potential impact of a disaster or interruption on the business processes or technology infrastructure and defines the resources and steps to recover. Whether you need a robust Business Continuity Plan or Disaster Recovery Architecture to meet business SLAs, comprehensive protection mechanisms against ransomware attacks, or governance and compliance standards, our team can help.

OpsGuru’s Clear Path Forward for BCDR includes a business and technical deep-dive assessment to identify business continuity requirements and provide architectural best practices for a robust Disaster Recovery solution leveraging Amazon Web Services.

Our team will conduct a thorough review of your existing environment, whether on-premise, public or private cloud and provide you with a detailed technical recommendation report, including - BCDR architecture, implementation specifics and expected total cost of ownership.

Alt text for the image

Conclusion

The CrowdStrike Falcon outage reminds us of the complexities and risks associated with software updates. By implementing rigorous pre-deployment testing, phased rollouts, enhanced monitoring, robust rollback procedures, comprehensive redundancy and failover plans, and effective communication, organizations can significantly reduce the risk of similar incidents in the future.

At OpsGuru, we have worked with several customers, such as Numeris, Edsembli and Trimac, to help implement the best practices defined above. To protect your business and ensure continuity, contact our team of experts today.

All Tags

OpsGuru Resilience

Share the Blog

OpsGuru Achieves AWS Generative AI Competency

OpsGuru AWS Generative AI

OpsGuru is proud to announce that we have achieved the AWS Generative AI Competency, further solidifying its position as an AWS Premier Partner and a leader in Generative AI solutions....

Jul 24, 2024