Engineering

The Process of Scaling SquadOV's Search to 10,000,000 Videos

Michael Bao
April 22, 2022 4:16 PM

The Crash

As some of you might know, SquadOV went down for a bit on April 14th. Shortly after telling everyone about our cool new logo in Discord (which you should definitely join), there were rumblings about "Is SquadOV down?"

The beginning of the end.

Uh oh! A quick look over at our logs quickly told us that all our servers were suddenly unable to connect our database. And then looking at the database queries that were being run, it began to be very clear that the database query we were using to feed our recent VODs and clip library pages was causing extreme performance issues - leading the database to become responsive.

3,000 seconds to respond, uh oh.

Yikes! A quick hotfix was put in place to disable this query for our recent VODs page which brought things back to normal (and introduced a good deal of confusion as now users were directed to the "Game Logs" page). We notified users via our status page (which we integrate in the app) to let users know this would be a fairly lengthy upgrade. But at least the amount of time it took from identification to servers being back up was only about an hour.

One hour response time from identification to fix, not terrible.

The Plan

So at this point, it became clear that some of our technical debt was coming back to haunt us. That database query had already been the cause of a lot of slowness in the app and it was about time it got an upgrade. But how? That query needs to search through all 10,000,000 VODs and clips as well as their corresponding match data for all the games we support. In hindsight, we probably could have alleviated the issue with a couple of read replicas. But, a change that was already planned was to move all this search/filter functionality to ElasticSearch since this is literally what it was built for so I went ahead with the idea. The plan was simple:

  1. Setup ElasticSearch locally as a docker container.
  2. Create RabbitMQ tasks to synchronize new matches and VODs to ElasticSearch.
  3. Create RabbitMQ tasks to update subparts of the VOD on smaller changes (sharing, favorite lists, etc.).
  4. Add in hooks in the API and other services to trigger the correct task when necessary.
  5. Rewrite our Recent VODs and Clip Library API calls to use ElasticSearch.
  6. Setup ElasticSearch Cloud for production deployment.
  7. Deploy all the above changes.
  8. Start migrating all the previously stored VODs and clips to ElasticSearch.

April 20th

Development work was completed on April 20th, and we began to roll out the changes and everything was going great until the database got knocked out. This time it was just a dumb migration trying to create an index on an extremely large table that already existed (oops). Fine, stopped that and continued on. Did a quick test where I migrated my own VODs and clips and things seemed to be working fine (and fast!). Time to start the migration only to be met with a good ol' database outage again.

Two database restarts in one day.

The fix admittedly was simple, restart the database and pause the migration until I could figure out what went wrong. The first reaction was to just reduce the number of workers we had doing the ElasticSearch migration and that seemed to stabilize things. Phew! Time to leave the migration on overnight and see where we are in the morning.

April 21st

9 hours later and I wake up to two things:

  • World of Warcraft games are supposedly not recording.
  • The rate of transfer was slow - so slow that the ETA for a full transfer was 8 days. Too slow.

Turns out that the previous day's database outages took out our WoW combat log processing as well. Oops! Let's just go ahead and restart that service...and...database outage. Restart the database...and another database outage. Yikes.

It's like double rainbow, but database outages.

Turns out, the way I upgraded the WoW processing workers to support the new ElasticSearch system was dumb and was just hammering the database and causing it to buckle under load. Fairly easy fix and finally got the WoW processing back on track. By that time we were about 700,000,000 combat log messages behind and that took about 10 hours to clear out. Yikes.

At this point, the speed of the migration was going at about 200,000 videos per hour with an ETA of a couple days. Better but not quite there.

April 22nd

Waking up this morning, there were still about 200,000 tasks left on the RabbitMQ queue (that translates to about 2,000,000 videos) and we were processing about 10 messages a second. Getting there, but still a tad slow. But at least all the VODs that needed to be migrated were in tasks on RabbitMQ. Time to shut down that worker. And the first bit of amazingness happened - the queue nearly instantly cleared out.

200k to 0 in a matter of seconds.

Turns out the producer worker was causing contention on the database table that all the worker processes needed to update. Oops. But hey! All VODs migrated!

So much speed

Faster searches, happier users.

So, the big question is, "did it work?" And the answer is a resounding yes. All our recent VOD queries are significantly faster now. The average response time is about 2.6x faster and the median response time is about 4.7x faster. This difference is noticeable in the app, especially for users in one of our larger public squads. This change took quite a bit of work and effort, but hopefully it's all worth it.

Thanks to all our users for staying patient with us, and we're looking forward to what's next!