CrowdStrike published its “preliminary post incident review” this morning. And it hides some horrific details inside the boring verbiage.
As you’ll recall, millions of PCs and servers bluescreened last week. The cause was a corrupt CrowdStrike security update that caused the machines to access illegal memory from within the Windows kernel itself.
Today, we learned two incredible things: That this type of rapid update isn’t tested by people; and that CrowdStrike doesn’t dogfood them, nor do staged, “canary” deployment. In today’s SB Blogwatch, we sit slack jawed in horror.
Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention: davepl’s take.
What’s the craic? Camilla Hodgson reports: CrowdStrike to implement new checks to avoid another global IT outage
“90 minutes”
Cyber security group will improve testing and stagger updates … to avoid a repeat of the global IT outage that hit millions of computers last week, as the cyber security company outlined the initial findings of its investigation. [It] planned to implement “a staggered deployment strategy” for updates similar to the one that triggered last week’s outage.
…
On Wednesday, in a preliminary review of the incident, CrowdStrike said the “undetected error” in the software had been missed due to a “bug” in its “content validator,” which is supposed to check for problems. … It took about 90 minutes for millions of machines to be affected by the faulty update, which began to be rolled out on Friday, before CrowdStrike discovered the problem and took action.
But how? Tom Warren adds: CrowdStrike is making improvements
“Staggered deployment”
On Friday, CrowdStrike issued a content configuration update for its software. … These updates are delivered regularly, but this particular configuration update caused Windows to crash. … A tiny 40KB Rapid Response Content file caused Friday’s issue.
…
While CrowdStrike preforms both automated and manual testing on Sensor Content and Template Types, it doesn’t appear to do as much thorough testing on the Rapid Response Content. [This] led to the sensor loading the problematic Rapid Response Content into its Content Interpreter and triggering an out-of-bounds memory exception.
…
CrowdStrike will … implement a staggered deployment of Rapid Response Content, ensuring that updates are gradually deployed to larger portions of its install base instead of an immediate push to all systems.
Horse’s mouth? CrowdStrike’s PR team have been burning the candle at both ends: Preliminary Post Incident Review
“Starting with a canary deployment”
On Friday, July 19, 2024 at 04:09 UTC, … CrowdStrike released a … problematic Rapid Response Content configuration update [that] resulted in a Windows system crash. … Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data [and] were deployed into production.
…
How Do We Prevent This From Happening Again? …
Improve Rapid Response Content testing [including] local developer testing. …
Implement a staggered deployment strategy, … in which updates are gradually deployed to larger portions of the sensor base, starting with a canary deployment.
Improve monitoring, … collecting feedback during … deployment, to guide a phased rollout.
So they don’t do canary deployments of these updates before pushing to customers? Kevin Beaumont notes “the key take away:”
Channel updates are currently deployed globally, instantly. They plan to change this at a later date to operate in waves.
…
Obviously they shouldn’t have been globally, simultaneously deploying kernel driver parameter changes across all customers: It was waiting to go wrong.
And they don’t even dogfood it? Crypto Monad enjoys some tasty canine fodder:
Maybe CrowdStrike should release their code to their own desktops and servers, an hour or so before releasing it to the rest of the the world.
Unbelievable. anonzzzies is looking for trouble:
CS should pay damages and go bankrupt, so others might think twice in the future. I wouldn’t want to live in a world where people can get away with mistakes like this.
…
If one person in CS had actually tried this file on their local windows machine, it would’ve crashed. That’s not a small thing is it? That’s a Boeing like culture which should be terminated today.
Well, the CEO is being dragged in front of a House committee. supremebob sounds slightly sarcastic:
Ooh, another two hours of political grandstanding! I’m sure that this will be super helpful to resolve the problem, just like the congressional hearings were against Facebook were.
…
Crowdstrike will probably need to make a few well timed “campaign donations” to help smooth things over, though.
How much blame is shared with CrowdStrike’s customers? Not a lot, thinks TeMPOraL:
I don’t think “CS are idiots” … was ever on anyone’s threat board. … What happened was pretty much an act-of-God-level crisis, except caused by a de-facto infrastructure provider with way more destructive power than they’re equipped to handle.
…
CrowdStrike should be liable for the damage. [And] this incident should be a prompt to rethink the entire idea of endpoint security.
Meanwhile, Badabing offers this colorful metaphor:
Farmer Brown is to implement a “check stable door is closed” policy with immediate effect. “We hope this will have a significant positive impact on horse retention,“ said Farmer Brown.
You have been reading SB Blogwatch by Richi Jennings. Richi curates the best bloggy bits, finest forums, and weirdest websites—so you don’t have to. Hate mail may be directed to @RiCHi, @richij, @[email protected], @richi.bsky.social or [email protected]. Ask your doctor before reading. Your mileage may vary. Past performance is no guarantee of future results. Do not stare into laser with remaining eye. E&OE. 30.
Image sauce: César Ardila (via Unsplash; leveled and cropped)
Recent Articles By Author