At 2:53AM Pacific, our normally rock solid Internet connection went down for 10 minutes. When it came back up at 3:03AM, that should’ve been the end of it. It wasn’t.

Short version: I had to reboot the firewall, then everything was fine.

Longer version:

  • Wake up at 7-something.
  • Yawn, stretch, pet the dog, look to see what happened online overnight.
  • Dang it, FRZ’s down.
  • Check email. See an email from a moderator from an hour earlier: hey, the site’s down!
  • Run downstairs. Instantly remember and regret that the coffee pot died yesterday and I have no caffeine, nor will I any time soon. Curse quietly.
  • SSH into FRZ. All looks OK from there, except that pgbouncer can’t connect to the database server.
  • Check the database server; it’s up, running, and twiddling its thumbs in boredom. That’s weird.
  • Use netcat from the FRZ server to verify that I can connect to the DB server. I can, but only with IPv6. IPv4 isn’t working. For ancient reasons. I had pgbouncer pinned to IPv4. Huh.
  • Speculation here:
    • I think that when the outage was over, the firewall found itself bombarded with frantic inbound connections from the FRZ server, and either temporarily blocked them or overloaded some kernel table or such.
    • There’s no easily visible evidence that either of those happened.
      • If it autoblocked the FRZ server – and it shouldn’t have, but here we are talking about it – it didn’t log it or notify about it.
      • If it was because a NAT table filled up or such, I didn’t get an alert on that, either.
  • Outbound IPv4 was just fine. I regret now that I didn’t check other inbound IPv4 ports on the firewall. I blame the lack of caffeine.
  • Lacking anything else to go on, I rebooted the firewall, and ta-da!, we’re back.
  • The Sidekiq queue has about 36,000-and-growing tasks to chew through. It’ll be a little while until we’re 100% back up to speed.

That was weird. I don’t know why that happened, and I don’t like that feeling. And as I write this, I remember that I’d pinned pgbouncer to IPv4 because one time IPv6 stopped working in a very similar way. Maybe the same thing happened then but in reverse?