Surviving and thriving through the 2022-11-05 meltdown

Background

After Elon Musk bought Twitter and started making a bizarre series of decisions about how to run it, people started logging into Mastodon to see what it’s all about. Lots of them. So, so many of them. In real numbers, Free Radical grew by 20% in the last week. Which is awesome, because it’s wonderful to see new faces excited and eager to join the fun. The downside is that new users, being new to it, tend to be understandably excited and exploratory, with lots of posting, following other new people, and doing the kinds of things that require server hardware to wake up and earn its living. I don’t have hard stats to back it up, but from eyeballing the logs, I estimate that the server load was about 4 times greater than it was 2 weeks ago.

Rumblings

I woke up yesterday, wondered what was happening online, and saw a few messages asking why people weren’t seeing the things they expected to see, like new toots their friends had made on other servers, notifications on their mobile apps, and the like. Huh. That’s interesting, and a little alarming. The site indeed “felt” slow. Then I noticed a stream of other admins asking questions like “hey, is anyone else seeing their server catching on fire?”

Uh-oh.

Time travel

Mastodon runs a lot of little work tasks in the background, such as “add user A’s new toot to the database”, or “notify server B that user C replied to someone there”, or “let user D’s phone know that there’s a new notification for them”. As these tasks come in, they’re added to a “queue” of work to be done, and a program comes along to act on each of those tasks. Ideally, and in normal operation, tasks are completed as quickly as they’re being added to the queue, and the site feels like it’s operating as soon as a user asks it to do something. I don’t know if there’s another word for it, but I describe it as “realtime”. When I checked the queue yesterday, it was running about 45 minutes behind realtime. Every action a user took had to wait for nearly an hour before its effects were visible. That’s not good.

Old architecture

Let’s take a moment to talk about how Free Radical was set up. A Mastodon server has a few components:

A PostgreSQL database server.
A Redis caching server that holds a lot of working information in fast RAM.
A “streaming” service (running in Node.js) that serves long-running HTTP and WebSocket connections to clients.
The Mastodon website, a Ruby on Rails app, for people using the service in their web browser.
“Sidekiq”, another Ruby on Rails service that processes the background housekeeping tasks we talked about earlier.
Amazon’s S3 storage service that handles images, videos, and all those other shiny things.

When I first launched the service in 2017, all of these services ran on the same DigitalOcean server with 4GB of RAM. I ran out of disk space pretty quickly because all of those delightful cat pictures people post take up a lot of hard drive, so I offloaded that to S3. The PostgreSQL database also grew rapidly, and I relocated that to a server running in my own house. (That has nice privacy implications, too. US courts have ruled that it requires more effort for law enforcement to subpoena data stored in your residence than in a cloud server.) We ran that way with minor occasional adjustments for a couple of years:

PostgreSQL is hosted in my house.
Media is in S3.
Everything else ran in the 4GB cloud server.

More about Sidekiq

When the queue is lagging behind realtime, whatever the root cause, the result is that Sidekiq isn’t working fast enough. The default Mastodon settings tell Sidekiq to use 5 worker threads, meaning that it can process 5 queued tasks at the same time. I turned that knob as Free Radical grew over the years, and had settled on having 25 worker threads. That is, it could handle queued tasks about 5 times as quickly as an untuned Mastodon instance. That worked well for years. Sometimes the instance would get flooded with a short burst of traffic, but those busy little workers would chew their way through the queue and most users would probably never notice that it was temporarily slow.

I’ll melt with you

I noticed something worrisome when I looked at the 45-minute old tasks: many of them were second (or third or fourth) attempts to interact with other servers. That’s unusual in normal operation. Sure, there are often a couple of servers temporarily down for service, but it’s uncommon to see many of them at once. And wow, there sure were many of them yesterday.

I have an unproven hypothesis. Suppose that Free Radical had several worker threads trying to contact instance Foo. Foo was running slowly, so those connections eventually timed out after many seconds, and Free Radical added a retry task to the end of the queue. However, while it was waiting for those connections to give up, it was responding to other servers slowly. Somewhere out there, server Bar was trying to deliver messages to Free Radical, and those connections were timing out because Free Radical was stuck waiting for Foo. That made Bar run slowly. Meanwhile, Foo is trying to contact Bar, but can’t because Bar is so loaded up. In other words, lots of servers were running slowly because they were waiting on all their neighbors to start running quickly.

Again, I can’t prove this. It would explain the traffic and queue patterns we saw yesterday, though, and I’d bet that a variation of this was happening.

Back to Sidekiq

The Sidekiq server is written in Ruby on Rails. That means that there are lots of people who understand it and can contribute to developing and improving it. That’s good. It also means that it’s kind of a slow-running resource hog. That’s not good. Other server software is written in languages much better suited for running many background processes at once. For example, Pleroma is written in Elixir, and Elixir is all like “oh, you want me to do 473,000 things at once? OK!”

Ruby on Rails isn’t Elixir. It’s not easy to just turn up the number of worker threads and go back to eating breakfast. That didn’t stop me from trying. And in any case, I had to run more threads somehow if we ever wanted to get back to realtime. These things happened quickly:

I increased the number of worker threads.

Since each one of them insists on connecting to the database at all times, the PgBouncer connection pooler ran out of available connections.

I increased the number of PgBouncer’s allowed connections.

Now we had lots of running threads, but the server was almost out of RAM.

We needed to get rid of something.

Moving Redis

Remember that bit about Redis caching things in RAM? That’s good under normal circumstances, but now Redis and Sidekiq were fighting over RAM. And that database server was just sitting there like a slacker running PostgreSQL and sipping espresso like a smug hipster. I launched a Redis service on that hardware, configured an encrypted tunnel for it, and told Sidekiq to use the new Redis server. Then I crossed my fingers, restarted Sidekiq… and it worked! The extra RAM let the worker threads start zooming along.

However, suddenly all of my Mastodon timelines were empty. Oh, rats. Turns out they’re all cached in Redis, and when I switched Sidekiq to the new server, it lost track of the old cached data. Mastodon conveniently has a command (tootctl feeds build) to recreate all that data. I ran that command, it started working, and then the queue started filling up again faster than the workers could clear it. That’s the opposite of what I was working for.

Raspberry jammin'

Out of the corner of my eye, I saw my little lonely Raspberry Pi 4, sad because it was waiting to be picked last at recess. Hey there, little buddy! Are you up to running Sidekiq? Yes, yes it was. Now, building the Sidekiq Docker image wasn’t a quick process. Docker running on a Raspberry Pi, using NFS for storage because the RPi’s own SD card is too slow and fragile, is about as sluggish as you might think. But it worked! And once the service launched, it wouldn’t be using much drive I/O anyway, and the RPi’s CPU is surprisingly capable.

I configured Sidekiq to point at the existing database server and the new Redis service, fired it up, refreshed the Sidekiq web UI, and saw that a huge new flood of fast workers was online and tearing through tasks like my dog goes through dropped potato chips. That… worked?! Yeah, it worked!

A short while later, we were back to realtime. I ran the timeline rebuilding command again, and the worker threads temporarily got as far back as 5 minutes behind realtime, but then caught back up and stayed there. We were back in business.

Where we are now

I feel like Free Radical turned a corner in this exercise. Until yesterday, aside from the database server, Free Radical was tightly bound to a single cloud server. Now we have:

PostgreSQL and Redis running on a large, fast server.
Media in S3.
The web and streaming services, and 1 Sidekiq service, running on a 4GB cloud server.
Another Sidekiq service sharing the work equally from a separate hunk of hardware.

I could move back to the old architecture now that the short-term burst of traffic is likely over, but why? Free Radical is in a great place, where the most resource-intensive part of the system can be horizontally scaled to a cluster of additional servers on a moment’s notice without reconfiguring or restarting anything. Even without that, we now have 160 workers instead of the previous 25.

The process was a little hectic, but I sure like where we ended up.

Technical details

As an update, the above went a long way toward bringing Free Radical back. A couple of days later I noticed that the queue was still filling up faster than expected sometimes, and that the worker process on each server was running at 100%. After research it seemed that it’s much better to run multiple Sidekiq processes, each with fewer worker threads, to take advantage of multi-CPU servers. Here’s how I enabled that.

First, I created up a much simpler config/sidekiq-helper.yml file:

This is similar to the full config/sidekiq.yml, but without the scheduler queue (because you should only have one scheduler queue worker running, ever) or any of its related scheduled jobs.

Next, I updated my docker-compose.yml on the main server to have multiple sidekiq blocks, like:

See how sidekiq_2’s command line specifies the new sidekiq-helper.yml file mentioned above? With this setup, sidekiq_1 runs all of the queues, including scheduler. sidekiq_2 runs all of them except scheduler.

On the Raspberry Pi, I created sidekiq_1 through sidekiq_4, each using the sidekiq-helper.yml config so that none of them were running the scheduler queue.

Also note the cpu_shares settings? Docker compose uses that to adjust each container’s CPU usage compared to the other containers. The default value is 1024, so this runs the worker processes at a lower CPU priority than the web and streaming containers, which helps keep the web interface and mobile apps nicely responsive.

Finally, I ran docker compose build to create new Docker images incorporating the config/sidekiq-helper.yml files and restarted the services on each server.

After I made these changes, the worker threads are no longer CPU bound and are completing queued tasks faster than ever.