We had severeral hours of downtime yesterday because of a “fun” PostgreSQL problem (see Beware of your next glibc upgrade for details). Short version: because there are lots of ways to encode text, and they can change subtly when you upgrade your OS, some of your unique database keys may end up being not so unique. This resulted in quite a few Mastodon tables having duplicate rows, despite having unique indexes. It also meant that PostgreSQL was all like “indexes? lol never heard of ’em” and query performance was pretty awful.

Fixing this started off looking like “try to run REINDEX index_foo, see what the dupes are, delete them, and try again until it works”. Yeah, turns out a few tables had several thousand dupes and I wanted to finish this before I died of old age. It was a whole lot faster to use a COPY ... TO '/tmp/foo.csv' command, run a short Python script to identify all the dupes in a single pass, and issue one giant DELETE FROM foo WHERE id in (1, 23, 42, 187, ...) query. (DBAs may recoil in disgust here - “why didn’t you just run a query to identify the duplicates and delete them from there?” Fair question, but remember that we had tables with 100,000,000 rows and the indexes on the columns we cared about were essentially non-functioning. After a few EXPLAIN ANALYZE ... queries that turned into a therapy session, that idea ruled itself out as not viable.)

After a few hours of that joy, I ran a final VACUUM FULL to clean up all the cruft, and that took another hour or so. Then and only then did I restart Mastodon, which is now running really well now that it can, you know, use database indexes.

For future such joy, I’m considering a custom 502 error handler that’s themed like the rest of the site and links you to this blog.