The day Claptrack lost his Mind

At around 19:04 (UTC), Claptrack suffered a catastrophic malfunction, that made him quite literally loose his mind. In a matter of seconds, Claptrack wiped out his entire database. Yes, the entire thing. All tracked projects, all server settings, gone. Like the bot was brand new and never used.

 

In this post we will explain the series of events that led up to this, what we learned from this experience and how we are preventing it from ever happening again.

 

What went wrong

Before we continue, we’d like to point out that both our FDD servers, run almost everything in docker containers. This will be relevant later on.

 

Around 18:57 on July 7, 2023, we noticed a misconfiguration in one of our docker containers, mainly Pterodactyl panel, which is responsible for running and managing our bots, minecraft servers, mavens and a bunch of internal tools.

To fix the configuration, a docker restart was required. (Think of it like a computer reboot, but without actually rebooting the machine). When this occured, everything went smoothly at first, and everything restarted and returned online like it was supposed to.

 

Now the next part is where things went wrong. Claptrack is contains a “Cleanup Thread” that runs once every 24 hours as well as when the bot is restarted. This thread is responsible for removing projects and data from the database, of discord servers that no longer use the bot. Like for example, if you kick the bot from your server, the cleanup thread will remove everything connected to your server.

Since this feature was implemented late may, it was working correctly, and without any issues. Until today. At the moment of restart, the discord API experienced a small drop, causing it to return incorrect information, when the bot requested it.

When the thread executed, the discord api only returned 2 discord servers. This made claptrack believe that every single project in the database, no longer belongs to any servers, and it did what it’s supposed to do. It deleted the data. In a matter of 2 seconds, I watched in horror as the console filled with “Deleted from database” messages. And just then complete silence. The database was empty.

 

Use the backup krunk… Wrong backup!

You’d think with the amount of data hosted on our servers, we’d have our shit together and have a decent backup system in place.

Well, we did. The problem is, during the backend revamp of Claptrack, the database path was excluded from the backup, since it was now in a new location. An oversight, that almost meant the end of the bot.

 

The only usable backup, was from early May, just days before the new backend was launched. The problem with this backup, is that it was from an SQLite database, with a bit of a different structure to the Postgres database we use now. This meant, we had to manually import data from this backup, into the database. Doing this, we were able to recover about 85% of the data that used to be in the database. Obviously the data in this backup was out of date, as the projects have updated several times after the backup was made.

 

After the backup was restored, the bot was placed into Read-Only mode, to allow the bot to update the database to the correct information, without spamming the living hell out of the discord channels it serves.

 

So what now, my server is missing data. Do you expect me to redo everything myself?

No we don’t. This whole disaster could’ve been averted, if we configured our backup systems correctly.

 

So we have reached out to every server that uses claptrack (where we could find invites to the discords or have contact with the owners). We have already restored about 35 projects this way, and will continue to restore them as users contact us.

 

What are we doing to prevent this from happening again

Well, what we should’ve done in the first place. Configure the damn backups correctly.

 

The claptrack database container is now included in the backup, and has already been backed up off-site, at the time of writing. We have also set up a secondary, replicating database. This means, that there are two, usable copies of the database, aside from the main one.

 

These backups take place every 24 hours, and are stored for 7 days. So if something like this should ever happen again, which it shouldn’t, the effect won’t be as severe.

 

Additionally, the cleanup thread has been patched to never run if discord returns a list of servers smaller than 50. The bot is currently used by 57 servers, so 50 is a safe threshold for now.

In the coming days, we’ll also be implementing a temporary file storage, that stores a list of all new projects and settings during the 24 hours that the backups do not run. This will allow us to always have up-to-date copies of the data that should be in the database.

 

Closing words

This has by far been the worst day for us here at FDD, and the worst “disaster” to ever strike Claptrack. We’ve learned from our mistakes, and will continue to work with users to get their data fully restored. As of the time this document is published, the bot will be back to full operation.

 

We apologize for any inconvenience this event may have caused you, and we are working hard to improve both Claptrack and the infrastructure behind it.