How much can web-scale app handle?

The usual approach when building Web apps is to use "a framework" and say that "it scales horizontally". That’s even true, under certain assumptions. Turns out that when you build a quiz app, those assumptions are invalidated pretty quickly.
Adventure in the Web realm
When the three engineers set out to build Quizzpy, none of us had experience in real-time applications, but plenty of experience in Web backends. The prevailing wisdom at that time was to "just use a real-time language" - like Go (if you are following the latest trends), Java (if you come from enterprise) or C/C++ (if you want to cut yourself). Somehow the option that won was to use Elixir - "a dynamic, functional language for building scalable and maintainable applications", based on Erlang VM. I came into the team after Proof of Concept was already written - and with tight deadlines. So the natural choice was to recruit one more guy who had no experience in Elixir and join the original developer. I was eager to learn a new language, knowing little more that Erlang had a good reputation, used Prolog syntax (which I knew from studies) and aimed to "be more scalable Ruby".
100 users: first problems
After writing Minimal Viable Product, rolling it out for friends & family on a couple of quizzes and resetting the database we thought it was ready for production. Imagine our surprise when close to 60 users operator of the quiz started complaining about slowdowns.
60 users? That’s… not a lot, right? But let’s put it the other way around: during a quiz, each user had to submit an answer within 10 10-second window, every 30 seconds. So 60 users meant 120 requests per minute. In web speak, 120 RPM meant your site is quite successful. Depending on your typical path that’d usually mean more than one thousand daily users. But for us, it meant just sixty users. And we needed much more. So calling it a success and scaling instances up was not the long-term solution.
In the push for MVP, we omitted one big thing: application-level metrics. We had DataDog, and we monitored VMs, but there was no information from within the app itself. That meant we were essentially going blind (with an exception for critical errors, which were monitored by Sentry). After adding (simple) Datadog integration, it became quite clear what the problem was: the database. For each user interaction, we had (on average) two events written to the database. Our CQRS framework had web-level assumptions that each event should be committed in isolation, to ensure best serialisation & consistency. It meant 120 writes to the database per second at worst peaks. Our low-resource VM & database instance was simply not prepared for that kind of traffic. Even if we scaled that linearly, that’s a known issue in Postgres - one should simply avoid large amounts of concurrent connections. Luckily, there was a tradeoff to be made - we only needed consistency per quiz. After some discussions, we came to a conclusion that’s agreeable from a product point of view - it’s not a banking app and user experience is a priority in this case. If something cataclysmic interrupted the quiz (e.g. an instance was restarted), we can reset it. This meant a whole new field of optimisations, unacceptable in Web apps was open to us. And we were ready to exploit it - in this case we just buffered database writes until the end of the quiz, and then flushed everything. If this sounds scary to you - it is. There are race conditions and possible data losses - but BEAM (Erlang VM) has idioms to deal with it, and resilience to avoid crashing in case of programmer errors.
This became sort of a pattern for the rest of application development: at a certain increment (usually 2-3x) we’d find a chokepoint, and focus on solving it long-term while increasing resources short-term to ensure user satisfaction. Then we’d have some time to implement new features decrease tech debt - and prepare for the next performance push.
200 users: Does it work until you benchmark it?
Well, the next one happened much earlier than we thought - at a meagre 2x increment from the last - two hundred users. Turns out we just patched the issue and declared a victory without really benchmarking anything.
Thoroughly humbled we wrote benchmarks. Due to WebSocket implementation, there was no ready benchmarking solution (like s6 for HTTP APIs). So the benchmark was just a simple Node.js client, running several hundred clients on each VM. I think the tally was around a hundred clients per core. In any case, it was much less performant than the real API, but we could run it in parallel with real users to see the impact.
This was another product trick we came up with on the fly: test quizzes. In addition to "scheduled" quiz times (twice a day) we sometimes offered a third one (or even fourth or fifth), with a much smaller prize - but one nonetheless. Dedicated users joined those, offered feedback and allowed us to test out features in advance before enabling these on "real" quizzes.
500 users: Complexities of counting heads
Another surprising feature that puzzled us was Phoenix.Presence - a built-in feature that clients to know how many other clients were there. Seemed like an ideal option for displaying a count of users during the quiz, right? Well, that was our impression until we saw wild fluctuations in the counter - to the point where "current users" was showing a couple of dozen, whilst the question summary was showing that several hundred people answered questions.
Turns out that during the 2 million connections benchmark this feature was disabled for performance reasons. So after some internal discussions, we decided to ditch it, in favour of a much simpler solution: Redis. It was easy enough to just keep a set of user IDs connected to the quiz, and display count based on that in the app. And we never had an issue with that - until we removed Redis entirely, that is.
1000 users: know your language runtime
The next blocker on our list stopped us in our tracks - and also the Elixir community. We even reached out for help, to no avail - our server was maxing out CPU whilst doing nothing! As if some spinlock was triggered. But Erlang VM was highly parallelised, how could that happen?
Turns out not everyone using the new shiny language knows its much older runtime - it’s a known problem in Erlang VM that the scheduler can hog your CPU if the process mailbox is too large. https://www.erlang-in-anger.com is a quite fitting choice of name for the book which gave us this critical clue. Protip: ETS is your friend when it comes to parallel reads and GenServer.call with {:noreply, state} works best for writes.
2000 users: we need a bigger VM
At a certain point, you just have to concede that you need a bigger machine - all the cores are maxed out, memory is starting to become an issue, etc. Sadly, we were already on a maxed-out one from Heroku (on a hefty bill, no less). So what can you do?
We migrated to Google Cloud. Maybe not the most CPU-cost-optimal choice, but the Tooploox DevOps team ensured us that GKE has the best UX, and Kubernetes will help us. And offered help with moving to Helm.
This proved to be a pretty good choice: although we never needed to run more than one app (that is, our backend), several other k8s features were nice to have:
- turns out there’s a library that joins your Erlang apps into a cluster using Kubernetes metadata,
- you can dynamically scale the GKE cluster,
- node affinity can be used to ensure some apps run only on certain nodes.
All in all, this allowed us to run several copies of our app, one of which was designated a leader. The leader had one big VM for itself and stored all quiz-related information in memory. It didn’t handle any client connections, only commands submitted by follower pods. Each follower handled client connections, managed metadata (like user count) and submitted commands to the leader. This star topology scaled up to forty thousand users (which we never reached beyond benchmarks).
5000 users: RabbitMQ woes
What surprised us was RabbitMQ scaling. It was also an Erlang app, and scaled horizontally, so why would it scale worse than our main API?
Turns out, there’s a cost to pay for stateful connections (like AMQP) and we could provide much better guarantees with our clustered VMs anyway. So we ripped out RabbitMQ (and later Redis) entirely, after discovering that there’s no distinction between calling remote and local process on Erlang - you just pass a PID (or {name, node} tuple) - and BEAM will take care of everything! It’s like magic once you start using it. You have all the same primitives you are used to (synchronous or asynchronous communication), and also circuit breakers. Much better than almost any RPC I’ve used so far.
10000 users: how much data can we pipe through Nginx
At this point we were pretty sure we could take on the world: we had one leader (with CPU utilisation under 20%) and a potentially unlimited amount of followers. One thing that grabbed our attention was the Nginx CPU - our load balancers were running pretty hot due to the whole web socket connection overhead. So our next step was investigating the API we kept unchanged during the whole process to cut it down. It was quite monstrous, with the server sending the whole client state on each message - but it worked best for clients, as they didn’t need to retain any information and just re-rendered everything on each update.
Sadly the next step never came to be - we weren’t able to secure funding for the future and we had to shut it down at the end of 2019.