SquadOV's Recent Maintenance - Reducing Database Storage Size

Michael Bao
June 28, 2022 2:06 PM

Recently, SquadOV went down for an extended maintenance - the purpose of this maintenance was to reduce the size of the database after the recent WoW infrastructure upgrade to control our growing hosting costs. Prior to this maintenance, SquadOV spent around $10,000 every month just on our database - about $1,500 for the database instance itself and the rest on storage for data and backups. The primary culprit was storing all the data we collect from World of Warcraft combat logs which caused our database to balloon to nearly 60TB in size.

Many terabytes of data stored in the database.

The recent infrastructure upgrade moved storage of this data into S3 - resulting in less data stored (in terms of raw number of bytes), faster read/write times, and cheaper per byte storage. For reference, AWS RDS charges $0.115 for every gigabyte we store every month while S3, at its most expensive tier, charges $0.023 per gigabyte.

The amount of data stored in S3 to store all the combat log data.

Furthermore, after transferring to combat log data to S3 (using a combination of compressed Avro and JSON files), we see that the total amount of storage needed to hold the equivalent of the 60TB database is about 2TB. So not only are we saving 5x per-gigabyte, we're also using 30x less gigabytes. At the most expensive storage tier for S3, I'd expect at current levels for this to cost about $47.92 per month compared to the $222 we were previously spending per day.

Of course, those cost savings can only be realized if we shed the extra disk usage on our PostgreSQL database - and unfortunately, there's no easy way to do this. And given that there are some unclear limitations on using AWS DMS for transferring PostgreSQL databases (notably "For heterogeneous migrations, the JSON data type is converted to the Native CLOB data type internally."), I didn't want to risk the potential of data corruption. So the only other option is to do a dump and restore process.

This isn't something new - for reference, in July 2021, we performed this exact same dump and restore process to transfer our database from Google Cloud to AWS. That downtime lasted for about 4 hours driven primarily by the fact that we dropped, at that time, the combat log data we had stored. I used that as a reference for how long I'd expect this new downtime to last. Unfortunately, I underestimated how much SquadOV has grown since then and how much more data we've accumulated in the database even besides the WoW combat log data (the answer is about 3TB worth of data primarily stemming from Valorant and Hearthstone). As a result, the maintenance has lasted for nearly 24 hours at this point and is still on-going. Sorry about that!

This database maintenance was critical in order to control our rising cloud hosting costs. However, there's been a couple of key learnings from this that I hope to apply going forward:

  • There's a lot more data that we can shave off the database and store into S3 similar to World of Warcraft (Valorant, League of Legends, TFT, etc.).
  • The way we match up WoW matches is horrendously inefficient (this was known to a certain degree already, but this maintenance just highlighted this again).
  • We need better ways of communicating maintenance and other downtimes. Currently, the only three channels are 1) in the app leading up to the maintenance, 2) on our status page, and 3) in our Discord.

Thanks again for your patience as SquadOV continues to upgrade itself to better serve gamers around the world.