Scaling live video feed from 100 to 10k viewers

May 8, 2019

When building a Quizzpy there were in reality two separate platforms that we needed to build and scale: API for the quiz itself (displayed in the UI as overlay) and video streaming platform (displayed in the background). After all, there’s no show without a showman.

Hopeful beginnings

The requirements for the platform as defined in the beginning were pretty loose: around three seconds of latency were acceptable (to preserve the "live" part of the show), and we targeted mobile phones - meaning anything doable on Android and iOS was fine. This meant we could use "raw" TCP and not care about compatibility with Web browsers.

The Proof of Concept phase went through multiple ideas: using platforms like Twitch or Youtube, and popular SDKs, but all were either too slow (in terms of latency) or quite hard to implement on both mobile OSes.

Instead, PoC ended up being a single nginx server running https://docs.nginx.com/nginx/admin-guide/dynamic-modules/rtmp/. The camera operator then used OBS to publish the stream, and all users subscribed to it. Latency was great, deployment was simple and there seemed to be no reasons to worry.

Scaling beyond first server

Soon enough we grew beyond our first server: in reality, this was one optimisation we did too soon before we saw any markers on CPU utilisation. It was just very easy to implement and we thought of it as "future-proofing" the platform.

What we did was simply add more downstream Nginx servers with almost the same setup. The operator still published to a central server, all downstream servers subscribed, and users then subscribed to downstream servers.

For the sake of simplicity, all downstream server IPs were added to a single DNS entry - an ideal round-robin setup. Then we monitored OS metrics on each server to ensure none was overloaded and new ones were added as necessary. We even minimised the costs by shutting down servers between quizzes - although primary stayed up to allow testing.

How can you trust it if you didn't benchmark it?

Except something weird happened: users were complaining about bad performance. Funnily enough, a lot of us in the software house were playing, and no one saw any issues.

The cause turned out to be pretty insignificant - when I created the dashboard, I scaled them in MBps, whilst OS metrics were sent in Mbps - meaning we saturated the link eight times faster than expected. Well, running one Ops might get you that if no one else knows what to look at.

After adding more servers (well, four or five as much) the issue went away. One important lesson and piece of tech built for solving it became video benchmarking - a fleet of small Docker containers running headless VLC (into /dev/null) to simulate client traffic. We could even run it live during quizzes to see the effects for real users. This became a daily routine of our test quizzes - an important building block for the next problem.

The unexpected

One piece of tech debt we never addressed was AWS accounts - streaming servers were running on the same account as some Tooploox internal tooling. When doing the cost estimations we concluded that this’ll be a negligible part of the bill. Just in case, we tagged all the instances to be able to see any unforeseen spikes.

Imagine my surprise when for the second month in a row head of accounting for Tooploox complained about rising AWS costs… when nothing was happening there except for Quizzpy! Time to investigate.

Turns out that beyond the cost of instance on EC2 there’s a second component: bandwidth costs. Ingress was indeed minuscule (one TCP stream) but egress was monstrous - a binary video stream per client! If I remember correctly, four users amounted to one MBps of traffic - meaning 128 of them could generate several terabytes over a ten-minute quiz.

Finding the right solution

There were in reality two problems here:

one short-term - with exponential user growth, next month's bill could become our last, and
the other long-term - we needed a solution with either scaled sub-linearly or where we could negotiate prices based on expected traffic,

Thankfully we had building pieces for quickly iterating: test quizzes and benchmarking tools to see the quality of the stream.

The short-term solution was quite easy in fact - find a cloud provider with cheaper bandwidth than AWS. I think we ended up running around a hundred small instances on Hetzner. With Terraform and Ansible automating that was quick so the most problematic part was convincing the provider to increase the limit of VMs in one region for us.

After calls with video experts from AWS, and digging into how Apple and other providers solve it, it turned out we needed a CDN - and one that’d be fine with us using their bandwidth for video. But there’s no CDN for raw TCP subscription - so we needed to reverse the model - clients pulling data instead. There is a video protocol that works like that: HLS. Using it sadly required relaxing business requirements - the best I could come up with was seven seconds of latency. This required very short segments, clients pulling manifest of video chunks from our servers and downloading video chunks themselves from CDN.

Finding each part of the solution required several test quizzes - some of them were weird, like everything working fine on Android, but a black screen on iOS (turned out iOS required SSL for all HTTP connections), or Android phones lagging far behind iOS (manifest had to be downloaded from our servers instead of CDN).

How it worked in the end

If I remember correctly we used https://www.cdn77.com in the end - and the cost of 10000 users came close to what we paid for 128 of them on AWS. We were even able to negotiate prices a little thanks to the predictable nature of our traffic.

What was interesting we ended up not using their dedicated video streaming solution - due to latency issues, but instead just used their CDN for video chunks, as described above.

It’d even allow us to finally have a Web client (since HLS is just HTTPS) if we found a player that could meet latency requirements. So far I only used it for quickly peeking at live quizzes via VLS locally.

Sadly the cost optimisation was not enough to save the product from lack of investors and we had to shut it down at the end of 2019.

Tags: tooploox terraform infra quizzpy

How can we squeeze more out of our DataCenter

How much can web-scale app handle?