site banner

Friday Fun Thread for November 1, 2024

Be advised: this thread is not for serious in-depth discussion of weighty topics (we have a link for that), this thread is not for anything Culture War related. This thread is for Fun. You got jokes? Share 'em. You got silly questions? Ask 'em.

1
Jump in the discussion.

No email address required.

Usually when I run into someone who pooh-poohs those tools, they're the sort of person who wants to write their own epic genius 1337 codegolf in-house tool that has zero documentation, is full of idiosyncracies, and will become someone else's pain in the ass when they leave the company in a year.

To use a toy example, discussing one aspect: lets say you have an app that needs to be up all of the time. A simple solution is to set up the app on one box and a standby on the next box. If it goes down, you simply respond and assess and confirm yes, the primary is down. Lets start the standby.

People absolutely cannot resist looking at this and saying well why do that when you can have the standby take over automatically. And yes I get it that's a reasonable desire. And yes, you can do that, but that model is so much more complicated and difficult to get right. There are frameworks now that help with this, but the cost of setting up an app now is still an order of magnitude higher if you want this kind of automation.

Unfortunately, the modern distributed computing environment is organized around the idea that everything needs to be as high availability as Google search or YouTube. This is the default way of standing up an app.

Maybe your business has one big bread and butter app that needs this, and by all means go ahead, but businesses also have like 100x as many apps that are just support or bean counting tools that absolutely don't need this that you kind of get pulled into setting up the same way. It becomes so difficult to set up an app that teams lose the vocabulary of even proposing that as a solution to small problems.

Definitely agree. One of the more challenging parts of my job is having to be the guy who who says, "Okay, you want this app to be HA... but why? If you can justify this to me and tie it to some positive business outcome that merits the extra engineering hours spent, we can do this. Otherwise, no." I've only ever worked on understaffed teams and so I've always had to be extremely judicious when allocating engineering effort. Most ICs want to do this kind of thing because it's cool, or "best practice," or they see it as a career builder/educational opportunity. FWIW in 1:1s I ask what their career growth goals are and actively try to match them with work that will help them progress -- so I'm not entirely unsympathetic to their wishes).

It also just seems a lot easier than it really is. There's the whole Aphyr Jepsen series where he puts a bunch of different distributed databases to the test that everyone knows are supposed to be good and correct and they fall apart miserably. Almost every single one. It's bad enough that people don't really understand the CAP theorem's tradeoffs, but the real world systems are even worse because they can't even live up to what they claim to guarantee.

If you really think your application has outgrown the directory tree of .json files or the SQLite instance, show me how you plan to deal with failures and data consistency issues. It's not trivial and if you think it is I'm not going to let you drive.

or they see it as a career builder/educational opportunity

I feel like this is the unstated rationale for using every single cloud provider's API

A simple solution is to set up the app on one box and a standby on the next box. If it goes down, you simply respond and assess and confirm yes, the primary is down. Lets start the standby.

Then the standby goes down, or doesn't start. Your next move? You start debugging shit when people around you run with their hair on fire and scream bloody murder at you, the system is down over 2 kiloseconds and you still didn't fix it yet, are you actually sent from the future to ruin as all?

And note that this will definitely happen at 3am, when you are down with the flu, your internet provider is hit by a freak storm and your dog ate something and vomited over every carpet in your house. That's always how it happens. It never goes the optimistic way. And then you realize it'd be cool if you had some tools that can help you in these situations - even it it means paying a little extra.

a bulk of my experience is in quant trading where every minute we were down cost tens of thousands. we actually did engineer a lot of systems the way I described just because they were so much easier to reason about and took much less effort to stand up and maintain

They are easier to reason about up to a point. Which a typical HFT trading setup will probably never cross, but a lot of other companies frequently do.

yes, and if we reach that point we will introduce the complex multi-master thing

but most things never reached that point

People absolutely cannot resist looking at this and saying well why do that when you can have the standby take over automatically. And yes I get it that's a reasonable desire. And yes, you can do that, but that model is so much more complicated and difficult to get right.

Who's going to get paged awake at 3AM on Saturday to run a shell script to fail over to the standby? I presume there's some services out there where two or three days of downtime is fine but I don't have any experience with them.

In contrast I find it's pretty easy to set up a service with a few replicas and a load balancer with health checking in front of it so that nobody needs to be manually selecting replicas. It's not complicated and with reasonable infrastructure it's a problem you only need to solve once and use it everywhere, in contrast to hand rolling another unreliable service that's going to become somebody's operational burden.

Put another way, being a pager monkey for one unreliable service is already dumb. Doing that for ten services is just ridiculous.

In contrast I find it's pretty easy to set up a service with a few replicas and a load balancer with health checking in front of it so that nobody needs to be manually selecting replicas.

yeah that part's easy. what about if you want to make the database they write to redundant? you have to worry about CAP issues and that makes things much more obnoxious

Yeah but you've presumably already had to solve that problem one way or another because you've (I assume?) already got a service that needs a highly available database. Surely configuring replication for MySQL isn't insurmountable for a leet haxx0r such as yourself.

no? not every system wants the same CAP tradeoffs. not everything benefits from the overhead of being a distributed system. it's not free to make something distributed.

example: github.com has real costs because it's planet scale and multi-master. it takes 250ms+ to process your push because it makes geographic redundancy guarantees that cannot physically be improved on. if you did away with the multi-master setup it would take significantly less time

you have "solved" the distributed system problem here but making every push take that much longer is a big cost. in this case, it just so happens github doesn't care about making every developer wait an extra 250ms per push

to say nothing about how you've also blown up complexity that needs to be reasoned through and maintained

(and yes, it doesn't have to be geographically redundant, I'm simply upping the scale to demonstrate tradeoffs)

I certainly don't call 250ms a big cost here. That is literally so small that I would never notice it.

Mmm, I notice it. if I'm working on solo projects I switch to a git repo on a personal server instead of github just to avoid it

no?

So you've got a system where you can't pay some latency for availability (I'll level with you, 250ms is an ass-pull on your part, even planet scale databases like Spanner that are way overkill for something like this can manage much better latencies, to say nothing of a simple MySQL master/replica situation), but it's totally fine if it goes down and stays down over a weekend?

If we're talking about a system where 24% uptime (aka business hours) is totally cool, yeah I guess you don't need to think about reliability, but again ive never seen a system like this so i don't know if they exist.

If we're talking about a system where uptime actually matters, it's totally unsustainable to page someone awake to do a manual fail over every time the primary shits the bed. That also comes with a cost, and it's probably bigger than a few ms of latency to make sure your database is available. Making stuff run all the time is literally what computers are for.

(I'll level with you, 250ms is an ass-pull on your part, even planet scale databases like Spanner that are way overkill for something like this can manage much better latencies

I can tell you for an absolute fact that plans to use Spanner to back geographically redundant multi-master git repos made latency even worse. But this is a digression.

(and yes, it doesn't have to be geographically redundant, I'm simply upping the scale to demonstrate tradeoffs)

I'm saying the magic distributed database introduces tradeoffs over the single SQLite file, and they vary by project and used github.com as a mundane but easily accessible example.