As some of you might know, SquadOV went down for a bit on April 14th. Shortly after telling everyone about our cool new logo in Discord (which you should definitely join), there were rumblings about "Is SquadOV down?"
Uh oh! A quick look over at our logs quickly told us that all our servers were suddenly unable to connect our database. And then looking at the database queries that were being run, it began to be very clear that the database query we were using to feed our recent VODs and clip library pages was causing extreme performance issues - leading the database to become responsive.
Yikes! A quick hotfix was put in place to disable this query for our recent VODs page which brought things back to normal (and introduced a good deal of confusion as now users were directed to the "Game Logs" page). We notified users via our status page (which we integrate in the app) to let users know this would be a fairly lengthy upgrade. But at least the amount of time it took from identification to servers being back up was only about an hour.
So at this point, it became clear that some of our technical debt was coming back to haunt us. That database query had already been the cause of a lot of slowness in the app and it was about time it got an upgrade. But how? That query needs to search through all 10,000,000 VODs and clips as well as their corresponding match data for all the games we support. In hindsight, we probably could have alleviated the issue with a couple of read replicas. But, a change that was already planned was to move all this search/filter functionality to ElasticSearch since this is literally what it was built for so I went ahead with the idea. The plan was simple:
Development work was completed on April 20th, and we began to roll out the changes and everything was going great until the database got knocked out. This time it was just a dumb migration trying to create an index on an extremely large table that already existed (oops). Fine, stopped that and continued on. Did a quick test where I migrated my own VODs and clips and things seemed to be working fine (and fast!). Time to start the migration only to be met with a good ol' database outage again.
The fix admittedly was simple, restart the database and pause the migration until I could figure out what went wrong. The first reaction was to just reduce the number of workers we had doing the ElasticSearch migration and that seemed to stabilize things. Phew! Time to leave the migration on overnight and see where we are in the morning.
9 hours later and I wake up to two things:
Turns out that the previous day's database outages took out our WoW combat log processing as well. Oops! Let's just go ahead and restart that service...and...database outage. Restart the database...and another database outage. Yikes.
Turns out, the way I upgraded the WoW processing workers to support the new ElasticSearch system was dumb and was just hammering the database and causing it to buckle under load. Fairly easy fix and finally got the WoW processing back on track. By that time we were about 700,000,000 combat log messages behind and that took about 10 hours to clear out. Yikes.
At this point, the speed of the migration was going at about 200,000 videos per hour with an ETA of a couple days. Better but not quite there.
Waking up this morning, there were still about 200,000 tasks left on the RabbitMQ queue (that translates to about 2,000,000 videos) and we were processing about 10 messages a second. Getting there, but still a tad slow. But at least all the VODs that needed to be migrated were in tasks on RabbitMQ. Time to shut down that worker. And the first bit of amazingness happened - the queue nearly instantly cleared out.
Turns out the producer worker was causing contention on the database table that all the worker processes needed to update. Oops. But hey! All VODs migrated!
So, the big question is, "did it work?" And the answer is a resounding yes. All our recent VOD queries are significantly faster now. The average response time is about 2.6x faster and the median response time is about 4.7x faster. This difference is noticeable in the app, especially for users in one of our larger public squads. This change took quite a bit of work and effort, but hopefully it's all worth it.
Thanks to all our users for staying patient with us, and we're looking forward to what's next!