site banner

Culture War Roundup for the week of July 15, 2024

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

9
Jump in the discussion.

No email address required.

So I keep trying to find any information on the technical aspects of this failure. As in, why is it bricking systems. I get that it's a driver that runs under the operating system, and it's failing to load. But why? I've only seen random reports that Crowdstrike literally pushed a corrupted file onto millions of systems, which is rather remarkable if true. If it was actually a bug, I'm deeply curious to hear what the bug was and how it slipped through.

To get really wild and speculative, lately it's been getting reported that Intel 13th and 14th gen I9 CPUs might be defective at incredibly high rates, upwards of 50%. These defects manifest in whole hosts of ways like BSOD, software crashes, and memory errors. I wonder if it's possible a defective Intel CPU borked the executable of an otherwise rigorously tested release. Like I said though, pure speculation. The nature of the Intel failures are still being investigated anyways.

Ok so there's an update on what happened.

The exact crash is caused by dereferencing a null pointer the offending assembly is readable by anyone, and it is as follows mov r9d.dword ptr [r8], the key is that the value of r8 is 0000 0000 0000 009c 9c is an offset of some sort set earlier, so it's derefrencing a null pointer. The pointer is NULL because the value in the file C-00000291.sys was published to be all 0s causing r9d to get loaded as all 0s

So the offending assembly probably looks like

read r8 C-00000291.sys (some offset)

add r8 9c

mov r9d.dword ptr [r8]

causing the bug.

From this, it kind of sounds like rather than having an on-disk data representation that would be parsed and converted to an in-memory data structure, they just loaded the file and accessed the raw bytes as a data structure with internal pointers. Which is... an approach, I guess.

It's an executable; that's how executables work.

Not really. The linker and a bunch of other transformations are going to happen before any of your instructions run. Dumping and loading bytes of a structure straight out of memory has long been considered a lazy and dangerous thing to do; no one is surprised that this sort of bug arose from it.

Eh, not really? Executable files have structure in them other than raw code and still have to be parsed by a loader. A file that's all zeros should fail to load. (Yes, I know DOS had .com files with were just code blobs loaded at a fixed address and immediately executed and I'm sure there are even more ancient examples of that sort of thing, but surely Windows kernel modules can't work like that.)

Anyway, the rumors I've read said that it was actually a data file and that's why they considered it acceptable to deploy it on a Friday -- the assumption being that changing configuration without rolling out a new version of the executable wouldn't break things too badly.

That might be how executables in an operating system work. Wouldn't be how extremely low level BIOS or ROM code that is meant to be executed before the OS loads would work. I can't say for certain exactly how that works these days, but when I was troubleshooting some BIOS code on an old computer of mine, I found myself decompiling a VGA BIOS. And that basically works by being in a certain memory block, it begins with a consistent signature to signal "Yup, there is code here" to the motherboard BIOS, and then it begins loading and executing instructions at a certain offset to initialize the card. Fun fact, you can actually reinitialize the VGA BIOS with a short assembly program that just CALL's to that location if memory serves.

What you are describing sounds more like a boot sector, i.e., raw machine code meant to be read from bootable media and executed directly by firmware (the mobo BIOS in your example)

I’d be surprised if in any modern operating system, executables (even those loaded and run at boot time) were handled that way. Then again, one is reminded of the old chestnut about idiot-proofing software…

The problem with turing machines is that pretty much everything becomes equivalent at high enough levels of generality. Windows EXEs (and DLLs) have a specific format that make it impossible to load an empty or (most) malformed files, but if the surrounding format is correct enough you can absolutely have it followed by a bunch of nonsensical instructions and memory locations -- there is a checksum, but (infamously), it isn't actually mandatory to load or run.

Worse, there's no rule that your executable is the only place that such instructions can come from, and few architectures try. Even in Harvard architectures like Atmels or PICs, there are specific instructions to transfer from the data bus into the program and vice versa. Modern operating systems on von Neumann architectures try to stop you from doing so by accident, by setting memory pages as either instruction or data, and in modern Windows machines further isolating data instructions with DEP, but it's ultimately just a set of flags.

There are arguments against doing this, in favor of having a having your base program load from more conventional configuration files with a strict format (eg JSON), or even having a very limited programming language that your core driver then 'runs'. They have some tradeoffs! But ultimately the problem is a lot more boring: in each case, you have to be able to recognize and respond to a corrupt file. And that's a solved problem! But you have to recognize it.

Modern operating systems on von Neumann architectures try to stop you from doing so by accident, by setting memory pages as either instruction or data, and in modern Windows machines further isolating data instructions with DEP, but it's ultimately just a set of flags.

Hmm, so a driver running with kernel privileges in Windows can just ignore memory segmentation and treat arbitrary memory as instructions to execute? Then I stand corrected.

More comments

I'm pretty sure I could write a C program right now that would run in Windows 10 that will load and run arbitrary assembly instructions from a binary file. The C program might have all the trappings of a proper Win10 executable, but the file it loads and runs sight unseen wouldn't. I'm pretty sure that's what the Crowdstrike driver is doing with the file full of 00's.

Apparently, the corrupted file was just filled with nulls:

https://twitter.com/jeremyphoward/status/1814364640127922499

I'm trying to image what might cause that; truncating the file and then failing to write it? My filesystem-fu isn't really up to par.

Saving a file using a filesystem that journals metadata followed by computer crash that happens before the file contents are flushed is one way to achieve it.

Wasn't there an old joke about an MBA cutting costs in half by getting rid of the 1s and standardizing on 0s?

This is not confirmed information, but I am hearing it on various technical grapevines and it seems plausible:

The primary bug is not new - the kernel-level driver that Crowdstrike runs (and has been running) has a dormant bug in the portion of it that parses config/data files. This update was "just" a config/data file, so deemed low-risk and put through fewer/simpler rounds of testing than a "real" update to their actual software. Whether it was a weird corner case or a malformed file, the kernel driver tripped over it and triggered the dormant bug. Since it's a kernel-level driver, crashing can affect the OS - and it did, generating an exception on a bad memory access (perfectly routine type of bug, but with privileges!) so the OS crashed.

Im not in a position to confirm but that seem quite plausible and dove-tails with some of what I've heard.

For some reason i find myself thinking of this old XKCD 😉

Lol that is amazing. Sounds like the most plausible explanation, but maybe even worse because it seems like that should have been caught in a dev or staging environment

Forget about dev or staging, there's no excuse for not fuzz testing your config parser in current year plus nine.