😦 The Real CrowdStrike Flaw Was In Deployment
By now, most of you are familiar with the story of CrowdStrike and the resulting worldwide outages. The initial, and angry, responses were along the lines of "Where are the devs? Where was QA?"
As it turned out, the actual patch, or software code, was fine. The problem was in the config file that governs the behavior of the patch. Being flawed, it triggered the Microsoft Windows' blue screen of death. In a variation of "Who Watches The Watchers," the config file was indeed tested, but CrowdStrike's testing system was unknowingly broken. The config file passed when it shouldn't have.
Nothing is perfect all the time, but what would have mitigated the fallout would have been a gradual or staggered rollout of the config file instead of the apparent "big-bang" release to all machines at once. The staggered rollout is an old and reliable technique. I can think of 2 reasons why it might have been forgotten or abandoned: inexperience or over-confidence.
When I worked at the NYSE, config file releases were, at minimum, on a 3 day schedule, no matter how trivial the change. It meant more vigilance and operations monitoring over 3 days, but the culture aimed for 99.9% uptime. Sure, we had our periods of inexperience (technology changed rapidly), and we had characters with grand hubris (what company doesn't?). What likely saved us was an attitude summarized by a project manager I knew there. He shared the time a developer fixed a bug and declared, "It was easy. Just one line of code."
The project manager knew it would be neither.
Comments