site banner

Friday Fun Thread for November 1, 2024

Be advised: this thread is not for serious in-depth discussion of weighty topics (we have a link for that), this thread is not for anything Culture War related. This thread is for Fun. You got jokes? Share 'em. You got silly questions? Ask 'em.

1
Jump in the discussion.

No email address required.

In cranky neckbeard era UNIX-based distributed system environments I almost never hit a problem that I can't diagnose fairly quickly. The pieces are manageable and I can see into all of them, so long as the systems are organized in a fairly simple way. Like maybe once or twice in 20 years have I been genuinely stumped by something that took weeks of debugging to get to the bottom of (they were both networked filesystems).

With cloud-based garbage, being stumped or confused about unintended behavior is more the norm. Especially on GCP. I am frequently stuck trying to make sense of some obscure error, with limited visibility in the Google thing that's broken. The stack of crap is so deep it's very time consuming to get through it all and we often just give up and try to come up with some hacky workaround or live with not being to cache something the way we want or weaken security in a way we don't want to. It's just ugly through and through and everyone has learned helplessness about it.

I almost never hit a problem that I can't diagnose fairly quickly

There can be only two reasons for that, based on my experience: either you are an extreme, generational quality genius, proper Einstein of bug triage, or you've just got lucky so far. In the former case, good for you, but again, that works only as long as the number of problems to diagnose is substantially less than one person can handle. Even if you take 1 minute to diagnose any problem, no matter how hard it is, there's still only 1440 minutes in a day, and I presume you have to also eat, sleep and go to the can. Consequently, this means a bigger system will have to fall into hands of persons who, unlike you, aren't Einsteins. And if the system is built in a way that it requires Einstein to handle it, the system is now under catastrophic risk. It could be that the system you're dealing right now is not the kind of system where you ever foresee any problem that you couldn't handle in a minute. That's fine - in that case, keep doing what you're doing, it works for you, no reason to change. I am just reminding that not all systems are like that, and I have worked many times with system that would be completely impossible to handle with the "lone genius" mode. They are, in fact, quite common.

There can be only two reasons for that, based on my experience: either you are an extreme, generational quality genius, proper Einstein of bug triage, or you've just got lucky so far.

I just know UNIX really well. It's not a freak accident. I used to go to bed reading UNIX programming manuals when I was a teenager. I know it at a fairly fundamental level. But it's also an open platform and there's been a lot of forks so there's been some natural selection on it as well on what runs today (not that it's all awesome everywhere).

I can't say the same about cloud platforms at all. They're purposefully atomized to a much larger extent and you can't see into them and there's no wisdom of the ancients text books that take you through the source code. The API is all you have, and the documentation usually sucks. Sometimes the only way I can figure some of the APIs out is by searching GitHub for ~hours to see if someone else has done this before, if I'm lucky.

Consequently, this means a bigger system will have to fall into hands of persons who, unlike you, aren't Einsteins. And if the system is built in a way that it requires Einstein to handle it, the system is now under catastrophic risk. It

None of what I'm arguing for really requires being the lone genius, but I recognize trying to hire teams of people with this kind of knowledge is probably a risk.

Whatever not my problem crank crank crank

Certainly I’ve found that diagnosing problems in Azure-based CI is an absolute nightmare because you can’t just go in and fiddle with stuff. You have reupload your pipeline, wait the obligatory 40 minutes for it to rebuild in a pristine docker, then hope that the print statements you added are enough to diagnose the problem, which they never are.

That said, it was still better than our previous non-cloud CI because it didn’t fail if you had more PRs than PCs or if you got shunted onto the one server that was slow and made all your perfectly functional tests time out. So I can’t condemn it wholeheartedly.