site banner

Culture War Roundup for the week of August 5, 2024

This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.

Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.

We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:

  • Shaming.

  • Attempting to 'build consensus' or enforce ideological conformity.

  • Making sweeping generalizations to vilify a group you dislike.

  • Recruiting for a cause.

  • Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.

In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:

  • Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.

  • Be as precise and charitable as you can. Don't paraphrase unflatteringly.

  • Don't imply that someone said something they did not say, even if you think it follows from what they said.

  • Write like everyone is reading and you want them to be included in the discussion.

On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.

8
Jump in the discussion.

No email address required.

The crowdstrike incident report is up

As far as documents go it shows that Crowdstrikes competence is... horrific.

Finding 1.

This means that when the sensor wanted to make a detection decision based on the IPC Template Type, the sensor code would supply 20 different input sources to the Content Interpreter. However, the definition of the IPC Template Type in the Template Type Definitions file stated that it expected 21 input fields. This definition resulted in Template Instances in Channel File 291 that expected to operate on 21 inputs. This mismatch was not detected during development of the IPC Template Type. The test cases and Rapid Response Content used to test the IPC Template Type did not trigger a fault during feature development or during testing of the sensor 7.11 release

What this says is that they did not test supplying IPC template type to the sensor at all or how many parameters the IPC template type produces? what kind of nonsense is thais.

\2. A runtime array bounds check was missing for Content Interpreter input fields on Channel File 291 Findings: The Rapid Response Content for Channel File 291 instructed the Content Interpreter to read the 21st entry of the input pointer array. However, the IPC Template Type only generates 20 inputs. As a result, once Rapid Response Content was delivered that used a non-wildcard matching criterion for the 21st input, the Content Interpreter performed an out-of-bounds read of the input array. This is not an arbitrary memory write issue and has been independently reviewed.

(hey can you prevent autoformatting for quotes it's really annoying that I can't exactly quote the doc)

So they didn't do the 1 liner test of checking array's inputs? I know in C you can't do this because array's do not contain their own length as a variable, but a c++ vector would have found this error (I guess in the kernel it's C or bust?). Congrats on using the root of all evil the regex So the regex created some interesting behavior on the (invalid) 21st input because of an OUT OF BOUNDS ARRAY access, oh boy.

  1. Template Type testing should cover a wider variety of matching criteria Findings: Both manual and automated testing were performed during the development of the IPC Template Type. This testing was focused on functional validation of the Template Type including the correct flow of security-relevant data through it, and evaluation of that data to generate appropriate detection alerts based on criteria created in development test cases. Automated testing leveraged internal and external tooling to create the required security- relevant data needed to exercise the IPC Template Type under all supported Windows versions within a broad subset of the expected operational use cases. For automated testing, a static set of 12 test cases was selected to be representative of broader operational expectations and to validate the creation of telemetry and detection alerts. Part of this testing included defining a channel file for use within the test cases. The selection of data in the channel file was done manually and included a regex wildcard matching criterion in the 21st field for all Template Instances, meaning that execution of these tests during development and release builds did not expose the latent out-of-bounds read in the Content Interpreter when provided with 20 rather than 21 inputs.

Automated testing somehow doesnt' include having 21 valid inputs in your 21 parameter funciton? Man now that's some brainpower ChatGPT can write tests better than that.

12 test cases which didn't seem to include any invalid inputs? where's your input validation? Where's the array bounds checking?

  1. The Content Validator contained a logic error Findings: The Content Validator evaluated the new Template Instances. However, it based its assessment on the expectation that the IPC Template Type would be provided with 21 inputs. This resulted in the problematic Template Instance being sent to the Content Interpreter

as expected NO INPUT VALIDATION

CLOWNSTRIKE indeed.

  1. Template Instance validation should expand to include testing within the Content Interpreter Findings: Newly released Template Types are stress tested across many aspects, such as resource utilization, system performance impact and detection volume. For many Template Types, including the IPC Template Type, a specific Template Instance is used to stress test the Template Type by matching against any possible value of the associated data fields to identify adverse system interactions. A stress test of the IPC Template Type with a test Template Instance was executed in our test environment, which consists of a variety of operating systems and workloads. The IPC Template Type passed the stress test and was validated for use, and a Template Instance was released to production as part of a Rapid Response Content update. However, the Content Validator-tested Template Instance did not observe that the mismatched number of inputs would cause a system crash when provided to the Content Interpreter by the IPC Template Type

Basically they didn't do integration testing.

Somethign like

IPCtemplatetype a= IPCtemplatetype.new(1,2,3,4,5,6,7) contentInterpreter b = Functionthatbreaks(IPCtemplatetype)

literally would have instant crashed.

They tested by having each thing be intependently tested by making a fake template type for the content interpreter but not using a real generated one.

Ok I know integration testing is hard, and get's exponentially complicated quickly but you can do basic tests by generating a single instance and then checking.

Or here's a billion dollar idea, just turn on a goddamn windows machine locally with your patch before sending it out. This patch broke ~100% of windows machines it came across, so you just needed to have done 1 manual patch of 1 fucking machine locally to have discovered this bug.

  1. Template Instances should have staged deployment Findings: Each Template Instance should be deployed in a staged rollout.

Basic procedure for every large org, and it wasn't followed at something this big? CLOWNSTRIKE continues

I understand when you have 100 customers, a delayed rollout literally does nothing, but at around 1000 customers it does and at the scale crowdstrike was operating at delayed rollouts are basically mandatory

ok the rest of the doc is mostly corporate jargon and meaningless, but boy this wasn't your normal fuckup this was a fuckup of epicly stupid programming oversight. Multiple errors that an absolute novice should have figured out which the most basic of tests would have found.

what the fuck is wrong with clownstrike

Petty programming nitpicks that don't matter, but still:

in C you can't do this because array's do not contain their own length

Arrays do (in compile-time, so if you have the type sizeof will return the actual size), it's just that they decay to pointers if you do anything with them like pass it to a function.

literally would have instant crashed.

Accessing data outside of an array is undefined behaviour and often won't crash if it's just 1 access outside of the end, it'll just fetch garbage instead. You'd have to build the program with an "undefined behavior sanitizer" that detects stuff like that, but I don't know if that's compatible with running in the windows kernel.

Undefined behavior isn't something you solve at runtime, it's literally implementation-defined behavior. Working in kernelspace is definitely not gonna give you the performance leeway to catch UB whoopsies. What should be catching it is a static analysis tool like Fortify, especially given Falcon is deployed on government hardware. If Crowdstrike doesn't have Fortify or a similar tool as part of their ops process, they've got security compliance issues.

Depends on what you mean by "solve", but the implication was that e.g. UBSan does detect many instances of UB at runtime – but not in the sense of it solving the problem in production as in catching and handling it, but solving the problem of easily detecting it during debugging and testing since you get graceful crashes instead of the program potentially loading garbage and continuing.

.Arrays do (in compile-time, so if you have the type sizeof will return the actual size), it's just that they decay to pointers if you do anything with them like pass it to a function.

I but a humble C++ programmer who hasn't used arrays except in a Class for so long that I forget that it only decay's to a ptr in certain cases.

Accessing data outside of an array is undefined behaviour and often won't crash if it's just 1 access outside of the end, it'll just fetch garbage instead. You'd have to build the program with an "undefined behavior sanitizer" that detects stuff like that, but I don't know if that's compatible with running in the windows kernel.

The UB would have resulted in NULLPTR except every time though I figured. Yes an UB sanitizer is probably unworkable in a kernel program I don't write kernel code.

Yes an UB sanitizer is probably unworkable in a kernel program

A kernel program is not that different from any other.

i think the problem would be the kernel would need to support the memory sanitizer. as long as your kernel module only touched memory allocated from its own functions then i guess in theory you could run a memory sanitizer without kernel support but your kernel module would be pretty useless. the problem is if the kernel gives you a buffer then how does the memory sanitizer that has no knowledge of the kernel know that the buffer is safe to read or write from. apparently windows does have support for kasan so vendor support should make it workable (https://www.microsoft.com/en-us/security/blog/2023/01/26/introducing-kernel-sanitizers-on-microsoft-platforms/). though, i don't use windows so i don't know how well it works. also, i guess you could just have a userspace test harness but for something like this you probably need some kind of final test with the module running in the kernel.

In this case it is, with completely different rules about stdlib usage, memory allocation, what can and cannot be paged etc.

I but a humble C++ programmer who hasn't used arrays except in a Class for so long that I forget that it only decay's to a ptr in certain cases.

To be fair, this is a wart in C's design. Nobody serious (in particular nobody that does kernel level programming) uses array arguments because decay is inconsistent. The last time I saw the topic discussed it was in the context of Linus bollocking someone over it.

having the out of bound entry as zeroes in testing and garbage in real life is also a way it can pass in the test but fail when deployed. imagine it is a struct and the function accessing it checks if one value is true and then just stops processing if the value is false. it wouldn't crash in testing but once deployed depending on the check it could have a very high probability of crashing. usually boolean check would decay into some kind of comparison to 0 so if the value is stored in 8 bits or even 32 bits then its very likely to be not 0.

Arrays do (in compile-time, so if you have the type sizeof will return the actual size)

They do, if their size is statically defined.

Arrays, including dynamically sized VLAs, always know their size in C and it can be queried with sizeof. You just need access to the actual array type. What you can't do is lookup the array type, and thus the size, after it has been type erased via pointer decay.

That would really depend on how you declare the array. int array[n] will know its size. 'int* array = malloc(4*n)` will not.

That's because you are not working with an array in the second case, merely a pointer to an int. You can often use such a pointer in an array-like fashion if it points to an element of an array, but since you are just referencing the element and not the whole array you don't have access to type information about the array. You could get an array that knows its size with something like int (*array)[n] = malloc(sizeof(int[n]));, but that is not common usage.

"Our test suite kept failing, so we disabled it"

The real failure is failure of management. How was this update allowed to over-ride client's specific update staging and rollout policies? Why did management allow the devs NOT to dogfood their product. It's been my experience with software dev, if the devs aren't dogfooding, the product probably sucks ass.

What about the nulled-out config people were posting on xitter?

The new IPC Template Type defined 21 input parameter fields, but the integration code that invoked the Content Interpreter with Channel File 291’s Template Instances supplied only 20 input values to match against.

I'm not sure I get it. The content interpreter takes a template instance and N inputs? Are they passing inputs as an array?

once Rapid Response Content was delivered that used a non-wildcard matching criterion for the 21st input, the Content Interpreter performed an out-of-bounds read of the input array.

Yeah, I guess they are. I'm going to tap the sign again and say that you really ought to be fuzzing your config parser in current year plus nine. A basic fuzz test that generates some arbitrary "rapid response content" and some arbitrary inputs would have ferreted this out in no time.

I want naming and shaming. If you fail this badly, there need to be very serious consequences for you personally. Fines, public humiliation, prison sentence, corporal punishment... Otherwise why would anyone bother doing things correctly in future?

What if a selection of Crowdstrike executives, coders and management had to have 'moron' tattooed on their foreheads?

Ask Congress to draft a letter of marque against CrowdStrike.

Put it in the queue. I'm still waiting the Wuhan lab staff to get drawn and quartered. Or at the least get 10 years hard labor.

As it happens, I'm fairly convinced now that covid was in fact zoonotic in origin and the Wuhan lab didn't leak. But still, it's insane to rely on labs not leaking contagious pathogens.

Could you share the readings that led you to believe that? I tend to hold the Jon Stewart "chocolatey outbreak in Hershey PA" view of things.

Basically the rootclaim/Peter Miller debate (as summarised by Scott) convinced me. In particular it appears that the pattern of very early cases strongly suggest that the very first emergence of the virus was at the wet market.

Containing viruses is hard. I'd consider a lab leak to be within the normal range of human screw ups. In light of that fact, everybody up and down the chain of command who thought it was a good idea to do gain-of-function research on coronaviruses should be tried on 10 million counts of manslaughter.

I know I'm not getting everyone up to Gorbatchev, they are the real culprits but it's not reasonable to expect. But give me a Dyatlov at least.

I'll settle for putting Fauci in a box for however long he has left to live. Let us maintain the illusion that there are any consequences whatsoever to mass death.

What did Fauci do that was "mass death"? Did he sign off on US funding for the Chinese GoF research?

Yes. Also conspired with other "researchers" to cover it up, also lied under oath to congress about it.

Why the scare quotes around "researchers"? Are you talking about political hangers-on who were trying to squash the story because Beijing money or in case it gave racists ammo or whatever, or are you talking about GoF researchers themselves? (I wouldn't scare-quote in the latter case; GoF is mad science, but science nonetheless.)

Fauci aporoved bat coronavirus research conducted by EcoHealth Alliance at the Wuhan Institute of Virology. So yes, that's specifically what he did. Fauci has denied that what they were doing matches the academic definition of gain of function; but he quibbled really hard trying to justify that.

I haven't researched this deeply, but I believe this is the case having heard some of Fauci's testimony regarding EcoHealth Alliance and his actions pre-pandemic.

That's not all, he's also behind the Lancet letter trying to actively prevent people from discussing the biolab origin theory. There's emails of him specifically making it happen.

That's enough for me, whatever he decides to call it he was ashamed enough of it to try and hide it by suppressing genuine scientific inquiry into something that was actively killing people.

What if a selection of Crowdstrike executives, coders and management had to have 'moron' tattooed on their foreheads?

That would be neither useful, nor justified. First, plenty of those people you named are almost certainly guilty of nothing except choosing to do as they were told instead of losing their livelihood. That doesn't deserve punishment. Second, for those who are truly guilty of negligence, losing their job is enough negative consequence as long as it isn't some bullshit "he got fired but he got paid handsomely for it" as is often the case for executives. We shouldn't reward incompetence (cue @faceh justifiably tapping the sign), but neither do we need to take extreme measures to punish it. Just regular punishment is enough, if we actually do it. Third, your solution would ultimately just cause people to work hard at covering up their sins and make things worse overall. You would have to be pretty stupid to do an honest post mortem if it meant someone was going to get a caning or a "moron" tattoo as a result.

This is how we get hundred billion dollar black holes, massive financial crises, wars that go nowhere based on pure fantasy and defiance of reality, 20 years of barking up the wrong tree on Alzheimers research due to fraud...

I believe in regular punishment for regular incompetence but this was above and beyond anything normal. Just doing as you're told isn't good enough for critical infrastructure like this. Any normal person tests updates before releasing them. And in the case of egregious failures where the whole organization has gone badly off the rails, why should anyone trust that they'd do a proper post-mortem? Any investigation should be run by outsiders, that's the most basic step.

Furthermore, punishment enhances public trust that everyone is in it together. The leadership class enjoys prestige and great wealth, they should also accept great penalties if they make massive negligent errors.

This is how we get hundred billion dollar black holes, massive financial crises, wars that go nowhere based on pure fantasy and defiance of reality, 20 years of barking up the wrong tree on Alzheimers research due to fraud...

Where did you read that take? It's not clear to me whether you mean Marc Tessier-Lavigne or Sylvain Lesne, but in both cases it's a huge overstatement. Both were peripheral to the main story of beta-amyloid plaques which originated before either author, had strong evidence in favor of it, and (I would argue) have strong evidence in favor of the plaque hypothesis being false at this point. But the enormous research and clinical efforts on beta-amyloid plaques would have happened even in the absence of either fraud.

I'll note that of the people in the field I've spoken with, most still believe in the plaque hypothesis and think we just aren't treating patients early enough or some other excuse.

This is how we get hundred billion dollar black holes, massive financial crises, wars that go nowhere based on pure fantasy and defiance of reality, 20 years of barking up the wrong tree on Alzheimers research due to fraud...

No, we get that by rewarding incompetence (there's that sign tap again...). We don't need to overcorrect to fix that, we just need to actually punish those people instead of promoting them or whatever.

I believe in regular punishment for regular incompetence but this was above and beyond anything normal. Just doing as you're told isn't good enough for critical infrastructure like this.

This isn't critical infrastructure, come on. It's freaking antivirus. It's not the only one, nor is it ubiquitous. It's just another software product.

Any normal person tests updates before releasing them.

I'm willing to bet you that the technical people did want to test updates. Maybe their direct managers did too, although that I'm less certain about. But at the end of the day, when your boss says "do this or else", very few people are willing to take the "or else" option. That's not unreasonable of them.

And in the case of egregious failures where the whole organization has gone badly off the rails, why should anyone trust that they'd do a proper post-mortem?

Because they did.

Furthermore, punishment enhances public trust that everyone is in it together.

We're gonna have to agree to disagree on this one. I don't think it helps.

No, we get that by rewarding incompetence (there's that sign tap again...). We don't need to overcorrect to fix that, we just need to actually punish those people instead of promoting them or whatever.

@RandomRanger's point is that if you are rewarded for recklessness (or punished for prudence) a lot of the time and only punished for recklessness when something goes wrong, the punishment when something goes wrong needs to be large to outweigh the benefit and thus provide a net disincentive.

This isn't critical infrastructure, come on. It's freaking antivirus. It's not the only one, nor is it ubiquitous. It's just another software product.

I hear that it's basically required in a bunch of fields for regulatory compliance purposes; is that not so? Also, uh, I can't get any hard numbers but I'm guessing a bunch of people died due to hospitals getting hit. When you're playing the government-contracts game, there are responsibilities attached to that.

I'm willing to bet you that the technical people did want to test updates. Maybe their direct managers did too, although that I'm less certain about. But at the end of the day, when your boss says "do this or else", very few people are willing to take the "or else" option. That's not unreasonable of them.

Usually when "do this" has massive negative externalities, you want a) to have the boss get in trouble for saying that, b) to have the civil/criminal penalty for "do this" be larger than the corporate penalty for "or else". Basic game theory; you want "do this" to never be picked, so you need to make sure those picking never have an incentive to pick it regardless of what other shenanigans are going on.

if you are rewarded for recklessness (or punished for prudence) a lot of the time and only punished for recklessness when something goes wrong, the punishment when something goes wrong needs to be large to outweigh the benefit and thus provide a net disincentive.

Losing your job is already a pretty big disincentive (assuming no golden parachute shenanigans). I don't think it needs to be bigger than that necessarily. On top of that, there's every reason to believe that the company is going to struggle financially as customers bail - this is further disincentive at the company level, and will affect the decision making at the individual level.

I hear that it's basically required in a bunch of fields for regulatory compliance purposes; is that not so? Also, uh, I can't get any hard numbers but I'm guessing a bunch of people died due to hospitals getting hit. When you're playing the government-contracts game, there are responsibilities attached to that.

Crowdstrike is not required, security measures are required. It's up to the regulated organization to choose how to implement that requirement. I don't think that they become critical infrastructure just because critical infrastructure orgs choose to make use of them.

Also, uh, I can't get any hard numbers but I'm guessing a bunch of people died due to hospitals getting hit.

Government happens on such a large scale that anything nontrivial which is less than perfect will lead to people dying. Never mind hospitals, you can calculate statistical loss of lives just based on the economic impact. And then you can follow up by calculating statistical loss of lives based on something that has only economic impact. If we don't allow for mistakes resulting in deaths, we can't have government at all.

Our personal intuitions don't scale up here. You on your own can only cause deaths by being malicious or by being so careless that you could reasonably be expected to not be so. But if something is large scale enough, even ordinary human imperfection is enough to cause deaths. A standard which says that you should never cause deaths there is unworkable; causing deaths is inevitable.

Usually when "do this" has massive negative externalities, you want a) to have the boss get in trouble for saying that, b) to have the civil/criminal penalty for "do this" be larger than the corporate penalty for "or else".

This doesn't work when the chance of the negative externality happening is small. Past a certain point, increasing the punishment won't cause any more deterrence.

Man, I was going to post it under the OP but didn't want to be too irritating.

But yeah. Some accountability to filter out the incompetence WOULD BE A NICE CHANGE OF PACE.

For what it's worth, I kind of love it when you tap the sign. It's a well written post and I enjoy reading it every so often.

Well that warms my soul a bit. I sometimes think about doing an expanded writeup on the topic, but there's no need since Nicholas Taleb literally wrote the book on it.

Since I do promote this idea as pretty much the "grand unified theory of institutional decay and dysfunction" I'm glad it hits the mark. I really don't want officials and CEOs committing Seppuku over their failures (well, maybe in rare cases), but I also can't help but think sometimes that my dream job would be The Assassin from Serenity.

Or here's a billion dollar idea, just turn on a goddamn windows machine locally with your patch before sending it out. This patch broke ~100% of windows machines it came across, so you just needed to have done 1 manual patch of 1 fucking machine locally to have discovered this bug.

That brought it home for me. Our IT department (a total of three people, one of whom never touches these projects) created a bug in their software and only caught it on the "trial" rollout. That caution might have saved nearly a dozen man-hours of workers waiting for them to revert the changes.

If we can get that right in a small company that barely touches software, how could a multibillion-dollar corporation that focused on security fail?

This is exactly my thought. I've been pushing for two additional tiers of production roll-outs for our software: Dark Releases and Smoke Releases. They've paid dividends for (relatively speaking) trivial solutions multiple times over. It's just such an obvious step to take.

Having enough discipline on your unit testing team to make things meaningful (I.E. not just Regex wildcard matching) seems so, so obvious too. But if you used the former approach you'd start seeing you have to fix that root cause many moons ago.

Yeah, there's situations where a bug gets past your development or test environment, and there's times where you're legitimately in too much of a hurry to test, but this is...

Like, the best spin I can give on this is that whatever safeties they did have were apparently enough to keep this from already happening earlier, when it didn't cause hundreds of millions of dollars in damage and probably a couple lives on the margins? But as much as the ratsphere jokes about the ideal rates of errors never being zero, the ideal rate of this error is so much lower than this that it's hard to compare.

I can forgive the testing errors. It’s hard to test every possible thing and I’m sure the developers were under a lot of time pressure to ship “features features features” because tech management just keeps asking for more every year.

But the lack of a canary rollout is unforgivable and would seriously make me question the software running on my machine (if I’m a customer). You can’t even rollout locally to some virtual machines you have in a staging environment to ensure they run before YOLO pushing to every customer? Even if it’s not “code” and it’s just a config you’re pushing, you absolutely must must must stage and canary rollouts. Not doing it is catastrophic incompetence, and I actually thank Crowdstrike for proving the point because now I have cover for a few years to tell managers to suck my dick if they ask for a YOLO rollout.

Some things in it are dumb, but I'm a bit sympathetic: you literally can't test everything; organizational mandates to test everything ("we're TDD!") just slow down development without adding much if any value; and if your organization relies on not having any shitty programmers on the team, your organization is doomed to fail, because at scale you will get shitty programmers.

The thing that's absolutely unconscionable is the lack of staged rollouts. That limits the blast radius of bad releases no matter how stupid the person was who made the bad release. That's like SRE 101.

I would say I'm appalled and surprised at how bad that is, but I'm not. It's just the state of software engineering as a field. We're all idiots and should die in a fire. (I'm a bit grumpy because I spent the day integrating with an LLM-based auto code eval system that represents scores with emojis.)

Man, can't wait for machines to tell me my code is shit because it's doesn't look like the garbage they were trained on and do it in the most passive aggressive manner possible. I love the future.

Oh, the machines won't stop at just your code.

https://github-roast.pages.dev/

I would love to hear about that code eval system lol. What's it called?

(I'm a bit grumpy because I spent the day integrating with an LLM-based auto code eval system that represents scores with emojis.)

Gonna need to hear more about that.

Internal system that is actually pretty solid (in the sense that ✅🤡💩 etc. carry a pretty meaningful signal). Why the emojis then? My suspicion is that the team who trained it doesn't want it being used the way we want to use it (or, rather, the way our PM and our users want it to be used), which is the only thing that makes sense to me.

The overarching issue is wanting to shove LLMs into everything. Tokens corresponding to emojis aren't really any worse than tokens corresponding to floats, but people would take the float tokens and treat them much more seriously than they warrant.

Yeah, obviously people make mistakes, and I'm somewhat in jest about how stupid these mistakes are (but seriously no bounds checking, no integration testing) the obvious one though is the lack of staged rollouts, or a single integration test called "send the patch to a local machine and see what happens". I like to call that "stage 0 rollouts".

It's also just funny that the nature of the bug is something that once you see a crash report you spot the bug instantly. it's also a pretty easy bug that many 3rd year programming students would have been able to avoid IMO.