Thursday, September 12, 2024

The CrowdStrike incident exposed the urgent need for modern DevOps practices

Share

On July 20th, 8.5 million devices running Windows crashed when cybersecurity giant CrowdStrike released a faulty software update. The ensuing outage wreaked havoc across nearly every major business sector: flights were grounded, medical procedures were delayed, and news stations couldn’t broadcast.

For the companies affected, the cost implications could reach tens of billions of dollars. However, this incident is part of a much larger, growing problem. Poor software quality cost the US economy at least $2.41 trillion in 2022. With customers and employees increasingly reliant on digital services, organizations urgently need to reassess how they deliver software to protect themselves from future failures.

A growing challenge

For those aiming to ensure their company doesn’t become the next headline in a major IT outage, the challenge lies in how easy it is for errors to slip through the cracks. Although engineers rigorously test their code, the sheer volume of releases means it’s inevitable that bugs will sometimes escape, even from the most experienced developers. Instead of expecting developers to deliver perfect code every time, organizations need to provide them with the means to manage risks effectively.

If CrowdStrike’s development teams had been able to roll out updates incrementally and instantly revert to a previous version at the first sign of trouble, the worldwide outage in July could have been prevented. With EU regulations like DORA and NIS2 on the horizon, organizations need to prioritize empowering their developers with these capabilities to maintain service reliability and avoid triggering a major outage. Failure to do so could result in non-compliance fines of up to 2 percent of annual global revenue, alongside damage to their reputation and immediate revenue loss.

Adopting modern DevOps practices

The most effective way to control the risks of software-driven innovation is to systematically push out updates to small batches of users, monitor results, and adjust the code based on feedback or behavior. This can best be achieved with modern DevOps platforms that support the end-to-end process and provide the tools developers need to deliver fast and reliable software updates. Alongside automation, two crucial capabilities for enhancing reliability and minimizing the risk of bugs include canary deployments and feature flagging.

Canary deployments involve making staged releases to a small group of users before a full rollout. This allows users to test the update and provide feedback, reducing disruption and mitigating the risk of major outages if a bug is present. Organizations can ensure the new version is reliable before delivering it to all users. Feature flags provide further control by enabling developers to turn functionality on and off in live services without deploying new code. This allows engineers to immediately disable problematic functionality and roll back the update, preventing any impact on users and enabling more experimental freedom.

A pipeline problem

It’s human nature to want to move quickly, especially when delivering minor updates. Developers are no different. In the rush to ship code, engineers have been known to skip stages in the development process. Organizations must embed capabilities like feature flagging and canary deployments into the delivery pipeline. They cannot allow quality control processes to be ad hoc, especially in larger organizations with hundreds or thousands of engineers.

To address this, it’s important to create an automated pipeline that ensures all checks and balances are completed without compromising speed. One effective approach is to adhere to platform engineering practices. This involves a central team establishing an internal developer portal (IDP) where teams can self-serve the capabilities they need to automate low-value development tasks, such as testing, and use standardized delivery pipelines with built-in canary deployments and feature flagging.

This type of automated quality control makes it much easier to ensure code releases are bug-free before deployment. It’s like having a system smart enough to instantly recall software and automatically fix every bug without human intervention or data gathering. This will be a crucial tool for developers as they aim to accelerate innovation without distractions.

A wake-up call

The CrowdStrike outage served as a stark reminder of the impact software failures can have on businesses across all sectors worldwide. As we move toward an increasingly digital and highly regulated world, the importance of software reliability cannot be overstated.

This latest incident was an unmistakable wake-up call for organizations to ensure their development teams are empowered to use modern DevOps practices effectively, ensuring a robust and reliable software delivery process. Ultimately, these capabilities will enable development teams to focus on accelerating innovation without the looming threat of catastrophic failures.

Image Credit: Alexandersikov / Dreamstime.com

Martin Reynolds is Field CTO at Harness.

Read more

Local News