This weekly roundup thread is intended for all culture war posts. 'Culture war' is vaguely defined, but it basically means controversial issues that fall along set tribal lines. Arguments over culture war issues generate a lot of heat and little light, and few deeply entrenched people ever change their minds. This thread is for voicing opinions and analyzing the state of the discussion while trying to optimize for light over heat.
Optimistically, we think that engaging with people you disagree with is worth your time, and so is being nice! Pessimistically, there are many dynamics that can lead discussions on Culture War topics to become unproductive. There's a human tendency to divide along tribal lines, praising your ingroup and vilifying your outgroup - and if you think you find it easy to criticize your ingroup, then it may be that your outgroup is not who you think it is. Extremists with opposing positions can feed off each other, highlighting each other's worst points to justify their own angry rhetoric, which becomes in turn a new example of bad behavior for the other side to highlight.
We would like to avoid these negative dynamics. Accordingly, we ask that you do not use this thread for waging the Culture War. Examples of waging the Culture War:
-
Shaming.
-
Attempting to 'build consensus' or enforce ideological conformity.
-
Making sweeping generalizations to vilify a group you dislike.
-
Recruiting for a cause.
-
Posting links that could be summarized as 'Boo outgroup!' Basically, if your content is 'Can you believe what Those People did this week?' then you should either refrain from posting, or do some very patient work to contextualize and/or steel-man the relevant viewpoint.
In general, you should argue to understand, not to win. This thread is not territory to be claimed by one group or another; indeed, the aim is to have many different viewpoints represented here. Thus, we also ask that you follow some guidelines:
-
Speak plainly. Avoid sarcasm and mockery. When disagreeing with someone, state your objections explicitly.
-
Be as precise and charitable as you can. Don't paraphrase unflatteringly.
-
Don't imply that someone said something they did not say, even if you think it follows from what they said.
-
Write like everyone is reading and you want them to be included in the discussion.
On an ad hoc basis, the mods will try to compile a list of the best posts/comments from the previous week, posted in Quality Contribution threads and archived at /r/TheThread. You may nominate a comment for this list by clicking on 'report' at the bottom of the post and typing 'Actually a quality contribution' as the report reason.
Jump in the discussion.
No email address required.
Notes -
Just wanted to push against messages like this, because this sounds like something from "revenge of the nerds."
Big systems like Twitter's have accumulated multiple layers of redundancy in case of failure over the years. There's probably quite a bit of automation to take care of the steady stream of problems like faulty hard drives or network cards. It can probably keep on going for quite some time this way.
Also, the biggest source of incidents? Change.
If so many Twitter engineers have left/been fired, then I imagine the rate of changes introduced into the system is approaching the level of a code freeze--basically a ban on introducing changes to the system around the holidays because they want to minimize risk even though it carrier a very high cost.
In this state, I would expect a skeleton would be able to keep things running for months. Especially if you can get some really good ones to tackle the 'black swan' type incidents that actually do require some clever thinking to fix--but again, this is all about pushing the systems back into a stable state (less risky) rather than "fixing forward" (more risky).
What I would be worried about is sabotage that can fall under plausible denial. Stuff like setting a primary key on a database column to an int32, which will hit the limit in weeks/months and is annoyingly hard to fix. But maybe by then Musk will have a larger set of solid engineers working at Twitter.
(1) Yes, there are a steady stream of problems addressable by automation, but those have never been a problem. SREs exist for the other problems.
Shit just falls over and you won't know why. That's just how these systems are. You can make a system that doesn't do that, but then you pay thousands of dollars per line written, which they're obviously not gonna do.
To put meat on the bones, see this list of common things SREs deal with, or this log of the SRE chatroom for Wikipedia & friends.
(2) Change is unavoidable and constant. There are security patches for your dependencies released continuously and you will update your system or face the consequences. Often times your dependency is an underfunded open-source thingy, despite your best efforts to avoid those, and thus the only way to get the new code is to use the newest version of the thingy, which means you might have to upgrade all of your code that uses the thingy.
(3) Regarding "pushing the systems back into a stable state" - then you're gonna have the same problem again unless you fix the root cause, which, again, requires code changes.
SRE is my day job :). Worked at one of these behemoths at some point, specifically deep on the infrastructure side of things.
None of these companies ever even dreamed of it. It's all about cheap hardware, multiple replicas, and the ability to reroute traffic between failure domains.
That's the thing--it's not constant. Like I mentioned earlier, companies do holiday code freezes so the rate of change decrease to a very small amount. Even security patches can be split into critical and non-critical, then those critical patches can be further split into "requires downtime" and "nothingburger."[1]
So if there's a feature freeze at twitter, then the rate of change is drastically reduced. And if people leave/get fired, that reduces the rate even further. And if you ignore all but the critical patches, then the rate begins approaching zero. That's a lot of "ifs", but all of them seem like good decisions with positive impact, also based in an accepted industry norm (code freezes), so I'm betting that management at Twitter will go down this path.
But let's wait and see! We're trying to infer what's happening inside of a black box. If my reality leans toward my bet, what I'm expecting to see is, over the course of the next year:
multiple instances of graceful degradation: users missing avatars for a few hours; intermittent general slowness; a few instances of data loss for a small group of users.
multiple instances of planned downtime.
a few instances of unplanned downtime, but no longer than 1-2 days.
Now, and correct if I'm wrong please, if reality leans toward your bet, what I would expect to see is:
multiple instances of unplanned downtime, ranging anywhere between a few hours to days, maybe even 1-2 weeks.
at least one prolonged outage (>4 weeks)
almost constant degradation of service: twitter being noticeably slower; multiple days when users can't log in; multiple instances of data loss for large (single digit %) group of users.
Let's see what happens!
[1]: Also, you reminded me about an oft overlooked source of change: shit expiring. Certificates, but also licenses, generators, and whatnot. These are silent killers, because they're hard to track and require manual work. I'm still counting them into my "low or no change" bet--that's where I would expect to see unplanned downtime that's fixed in a couple of hours.
More options
Context Copy link
More options
Context Copy link
More options
Context Copy link