site banner

A Writeup On The Reason The Motte Relaunch Was Rocky

I think anyone who's been watching this switchover has noted it hasn't been the smoothest. I'm still kinda decompressing from that and I figured I'd write up why, just so you could all marvel at the ridiculous chain of catastrophes.

So.

We get the site up. People register their accounts. People start almost immediately reporting 429 errors when registering.

429 Too Many Requests is an error that means a user has done too much stuff lately, commonly known as "rate limiting". A lot of the site is rate limited, but it should be rate limited well above what an actual human will do. For example, the account creation is rate-limited at 10 per day per person; if you need more than ten accounts every day then uh maybe you're not behaving quite like we want.

Of course, people weren't making ten accounts per person; rate limiting was broken.

We looked into the rate limiting code. Rdrama runs on a service called Cloudflare, which relays connections and does a bunch of fancy caching and performance optimization and also doesn't provide service if you're farming kiwis. An annoying thing about this kind of a service is that it makes it a little trickier to figure out "who" someone is; Cloudflare includes that information on requests, but it's not in the normal place. The rate limiting code was using the Cloudflare-specific IP info. Problem: We're not on Cloudflare. So that info was just wrong. I took out the Cloudflare-specific stuff and the problem did not get fixed in any way.

Well, Cloudflare does all this fancy optimization (it's called "reverse proxying", please don't ask why), but actually, so do we. The Motte runs on the same server setup as The Vault, and The Vault is specifically designed to be extremely cacheable. We've got our own little similar frontend server doing something identical, and all connections, including Motte connections, go through it. This means we needed to get the IP from our own reverse proxy, using a different technique, which we did, and which also entirely failed to fix the issue.

At this point I tried to disable the rate limiter entirely. The rate limiter refused to disable. We'll get back to this one.

The reason, I guessed, the reverse-proxy IP didn't work is that our reverse proxy is actually behind another reverse proxy. It's reverse proxies all the way down. You may not like it, but this is what peak web development looks like. Anyway, we were getting one layer further up, but we needed to be another layer further up. The hosting service I use does in fact have a switch for enabling this; it's called Proxy Protocol. I turned Proxy Protocol on and the entire site instantly went down. So I flipped it back and the site came back up. Then I did this a few more times just to be sure it wasn't a coincidence. It wasn't.

It turns out that the reverse proxy run by me requires some very specific configuration settings to be compatible with the Proxy Protocol setting. The problem is that I'm running this proxy in sort of a weird way. Most people using this server architecture have, like, an entire devops team. I don't! It's just me. And I don't really know what I'm doing. So cue half an hour of occasional outages as I try something new. It is worth noting that some of the changes I made also broke the site, but I was suspicious that the two changes had to be made together to work at all, so sometimes I'd break the site, then I'd break the site in another way, then I'd sit there for a minute hoping it worked, and it wouldn't, and then I'd revert both changes.

Finally I figured out the magic incantation! The site worked, we got IPs, the rate limiting was functional. The 429 error was forever vanquished! I looked at the site, and checked the perf charts, and noted that we were capping the CPU on the absolute-bottom-barrel server I'd chosen, so I figured, hey, I tried moving servers before as part of a test, this should be fine, let's just fork over an extra $12/mo and boost the server a bunch, and I did this, and the site broke entirely.

I spent another thirty minutes trying to fix it; if anyone noticed the site being entirely down for a while, well, that was me trying to untangle what was wrong. I tried connecting directly to the site from its own computer; it didn't work. I spent twenty minutes analyzing this and eventually realized I was just doing it wrong. Worked fine once I did it wrong. I eventually decided this was a routing issue and had a deep suspicion.

See, Proxy Protocol was set using a switch on the hosting provider's GUI. But that's sketchy as hell - why is it a manual switch? I went back and checked and sure enough it had gotten turned off. So I turned it back on.

Site back up and running.

As near as I can tell, there is a switch on the GUI. But this switch is also overridden by some settings in my configuration. Importantly, it's overridden irregularly; sometimes you'll do something, and it'll say "oh shucks, gotta go check that switch!" Because I hadn't realized this, it went and checked it and dutifully turned it off again.

I think I've fixed that now.

So, what was the deal with rate limiting not turning off?

If you use Kubernetes to run a process, and you tell it you want the latest version of a Docker image, it will download that latest version every time you restart the process.

If you tell it you want a specific labeled version, then it won't. It'll just use whatever it has, even if the label has changed.

So if you changed from "latest" to "dev" and "main" . . . then things just don't update when you think they will, and this change happens silently unless you're aware of what Kubernetes is about to do.

I think I've fixed that now too.

I bet this new server makes things faster, doesn't it?

Nope.

Turned out the CPU usage wasn't even coming from The Motte. It was an Archive Warrior I was running on that just to soak up some extra bandwidth. Apparently it's just stupidly CPU-hungry?

I think I've fixed that also.

And that was my day, more or less.

How's your day going?

(Extra thanks to the various people who were helping out on Discord, incidentally, especially Snakes who fixed a whole bunch of not-quite-as-critical-but-still-pretty-dang-important stuff while I was fighting with the servers.)

(Edit: I forgot to mention that I also spent a few hours trying to unclog an HVAC drain line so it wouldn't flood the house. That doesn't even feel like the same day anymore.)

40
Jump in the discussion.

No email address required.

When rDrama was new (and then not so new), we still had constant outages and hilarious glitches all the time. Like Aevann got 0 sleep for the first six months or so. Every time he’d add a feature, a million other things would break. Then as soon as he’d fix one of them, two million more would break. There was one night where we couldn’t comment or view any threads because something innocuous broke when patching something else and we all just communicated via thread titles and publicly visible reports for hours.

Now everything runs incredibly smoothly and Aevann has learned a ton just through endless trial by fire (and sleep deprivation) for those months. We add huge new things all the time and are constantly optimizing early jank with new knowledge and nothing ever really breaks for more than a couple minutes at worst anymore.

I realize this is a No Fun Allowed by design place, but it’s important - for your own sanity as the dev, and for the userbase’s tolerance of early growing pains and learning moments - to take it all in stride and have fun with it. We fostered a culture immediately of “shit’s going to break, we’re learning, deal with it” and people have always taken it in stride and memed about it endlessly because we built that culture up. No one gets mad. No one has to make tedious mea culpas because we broke something. We reward people for breaking things and encourage it because then we can fix an issue we weren’t aware of. This is a good system and lets people have fun and not freak out when things break.

I’d strongly recommend not setting an expectation for lengthy technical explanations of what happened and why when something goes wrong. A sentence or two at most. “Sorry to I was drunk and fell asleep at 4am trying to fix something else and I was too tired to fix it, that’s why you could only communicate via dick pics for 6 hours” is perfectly serviceable.

That’s one of the many nice parts about not being a massive global megacorp like Reddit. You’re just a few dudes doing something for fun. You don’t owe stakeholders receipts for something that broke. There are no stakeholders. The userbase will understand. But if you go about explaining everything that went wrong every time something goes wrong, you’ll breed mounting discontent and you’ll never have time for anything else.

Lighten up nerds.

Unironic thanks, we'd be struck on reddit without the rdrama code, and reddit put a damper on autistically examining every aspect of our culture in the nerdiest way possible. (Trains. I mean trains) test: 🚂🚃🚃

Honestly a good reason I wrote it out is just because it was a funny clusterfuck. But yeah, the whole place is going to be unstable for a while; that's just the truth of launching a new service, especially when you don't have a full-time dev team.

In general I'm not going to be posting these unless they're specifically interesting :V