I think anyone who's been watching this switchover has noted it hasn't been the smoothest. I'm still kinda decompressing from that and I figured I'd write up why, just so you could all marvel at the ridiculous chain of catastrophes.
So.
We get the site up. People register their accounts. People start almost immediately reporting 429 errors when registering.
429 Too Many Requests is an error that means a user has done too much stuff lately, commonly known as "rate limiting". A lot of the site is rate limited, but it should be rate limited well above what an actual human will do. For example, the account creation is rate-limited at 10 per day per person; if you need more than ten accounts every day then uh maybe you're not behaving quite like we want.
Of course, people weren't making ten accounts per person; rate limiting was broken.
We looked into the rate limiting code. Rdrama runs on a service called Cloudflare, which relays connections and does a bunch of fancy caching and performance optimization and also doesn't provide service if you're farming kiwis. An annoying thing about this kind of a service is that it makes it a little trickier to figure out "who" someone is; Cloudflare includes that information on requests, but it's not in the normal place. The rate limiting code was using the Cloudflare-specific IP info. Problem: We're not on Cloudflare. So that info was just wrong. I took out the Cloudflare-specific stuff and the problem did not get fixed in any way.
Well, Cloudflare does all this fancy optimization (it's called "reverse proxying", please don't ask why), but actually, so do we. The Motte runs on the same server setup as The Vault, and The Vault is specifically designed to be extremely cacheable. We've got our own little similar frontend server doing something identical, and all connections, including Motte connections, go through it. This means we needed to get the IP from our own reverse proxy, using a different technique, which we did, and which also entirely failed to fix the issue.
At this point I tried to disable the rate limiter entirely. The rate limiter refused to disable. We'll get back to this one.
The reason, I guessed, the reverse-proxy IP didn't work is that our reverse proxy is actually behind another reverse proxy. It's reverse proxies all the way down. You may not like it, but this is what peak web development looks like. Anyway, we were getting one layer further up, but we needed to be another layer further up. The hosting service I use does in fact have a switch for enabling this; it's called Proxy Protocol. I turned Proxy Protocol on and the entire site instantly went down. So I flipped it back and the site came back up. Then I did this a few more times just to be sure it wasn't a coincidence. It wasn't.
It turns out that the reverse proxy run by me requires some very specific configuration settings to be compatible with the Proxy Protocol setting. The problem is that I'm running this proxy in sort of a weird way. Most people using this server architecture have, like, an entire devops team. I don't! It's just me. And I don't really know what I'm doing. So cue half an hour of occasional outages as I try something new. It is worth noting that some of the changes I made also broke the site, but I was suspicious that the two changes had to be made together to work at all, so sometimes I'd break the site, then I'd break the site in another way, then I'd sit there for a minute hoping it worked, and it wouldn't, and then I'd revert both changes.
Finally I figured out the magic incantation! The site worked, we got IPs, the rate limiting was functional. The 429 error was forever vanquished! I looked at the site, and checked the perf charts, and noted that we were capping the CPU on the absolute-bottom-barrel server I'd chosen, so I figured, hey, I tried moving servers before as part of a test, this should be fine, let's just fork over an extra $12/mo and boost the server a bunch, and I did this, and the site broke entirely.
I spent another thirty minutes trying to fix it; if anyone noticed the site being entirely down for a while, well, that was me trying to untangle what was wrong. I tried connecting directly to the site from its own computer; it didn't work. I spent twenty minutes analyzing this and eventually realized I was just doing it wrong. Worked fine once I did it wrong. I eventually decided this was a routing issue and had a deep suspicion.
See, Proxy Protocol was set using a switch on the hosting provider's GUI. But that's sketchy as hell - why is it a manual switch? I went back and checked and sure enough it had gotten turned off. So I turned it back on.
Site back up and running.
As near as I can tell, there is a switch on the GUI. But this switch is also overridden by some settings in my configuration. Importantly, it's overridden irregularly; sometimes you'll do something, and it'll say "oh shucks, gotta go check that switch!" Because I hadn't realized this, it went and checked it and dutifully turned it off again.
I think I've fixed that now.
So, what was the deal with rate limiting not turning off?
If you use Kubernetes to run a process, and you tell it you want the latest version of a Docker image, it will download that latest version every time you restart the process.
If you tell it you want a specific labeled version, then it won't. It'll just use whatever it has, even if the label has changed.
So if you changed from "latest" to "dev" and "main" . . . then things just don't update when you think they will, and this change happens silently unless you're aware of what Kubernetes is about to do.
I think I've fixed that now too.
I bet this new server makes things faster, doesn't it?
Nope.
Turned out the CPU usage wasn't even coming from The Motte. It was an Archive Warrior I was running on that just to soak up some extra bandwidth. Apparently it's just stupidly CPU-hungry?
I think I've fixed that also.
And that was my day, more or less.
How's your day going?
(Extra thanks to the various people who were helping out on Discord, incidentally, especially Snakes who fixed a whole bunch of not-quite-as-critical-but-still-pretty-dang-important stuff while I was fighting with the servers.)
(Edit: I forgot to mention that I also spent a few hours trying to unclog an HVAC drain line so it wouldn't flood the house. That doesn't even feel like the same day anymore.)
Jump in the discussion.
No email address required.
Notes -
Thank you for all the effort you have put in and continue to put in! You're the best.
More options
Context Copy link