How can we squeeze more out of our DataCenter

Compared to previous infrastructural projects I had the opportunity to work with at Tooploox (Insights and Quizzpy), the setup in Opera was refreshingly simple and battle-tested:
- several data centres, operated by infrastructure teams,
- servers with static IP addresses - either physical or virtual - given to each team, with SSH access based on a central public key registry,
- server orchestration based on Puppet - in this case my team operated an isolated puppet master just for our infrastructure,
- central load balancing done by http://www.linuxvirtualserver.org/software/index.html, handing off connections to separate machines,
- each machine running a standard nginx + gunicorn or nginx + uwsgi setup for handling connections,
What was needed
At the height of the pandemic, everyone started using the Internet much more. This in turn meant that existing scaling plans (over multi-year setups to replace older machines with newer and more powerful ones) had to be suddenly rushed. Which caused a shortage.
In the middle of this, my team was left with existing (pretty old) hardware suddenly scrambling for more computing power. The backend was already pretty well-optimised, so another route had to be found. I was the one looking and turning every stone for a solution.
How I approached it
In programming, everything is a trade-off; so I needed to find one trading CPU for anything else really - disk, memory, bandwidth. I started with local benchmarking the only piece of the stack that was consuming substantial processing power: our service written by Python.
The first step was getting metrics: we already had statsd, but for reasons lost in history we didn’t have full coverage throughout Opera private libraries. Amending that was the first step.
The next one was benchmarking: the service was using a custom serialization format to save bandwidth, so I needed to find a way to replay some of the production traffic on the test server, without compromising data integrity or accidentally affecting the results.
Having benchmarks, we had a pretty good picture of which operations took how much CPU and time - but couldn’t see any obvious place for changes. We needed to dig deeper: so I added tracing (with OpenTelemetry) and https://www.jaegertracing.io/.
Traces started showing something interesting - pretty wild slow ups in the processing, invisible even on 99th percentile metrics. Less than 1% of traces (0.1% maybe?) took 1000x longer than usual ones.
What the solution was
Now having an idea what to look for, I could find a culprit in code - batching items before sending them to our queue system. It was both complex (predating certain features in the queue SDK) and inefficient (in terms of CPU).
My proposed solution was to entirely rewrite the library, allowing for switching between CPU and bandwidth-optimal implementations, whilst ditching features already implemented upstream.
This was just a first step though, because afterwards, we needed to test it, add support for new data format in consumers and ensure data integrity was preserved. Having a pretty good setup for canary deployments, the solution was to "fork" data: one copy was written in old format, and another in new. Once we had a consumer, we could verify both copies contained the same data. Afterwards, we could switch a certain percentage of requests to the new format, to ensure we had spare CPU to process spikes.
This new format traded roughly 30% CPU time for 30% bandwidth, whilst the new implementation kept the performance characteristic of the old one (but was simpler in terms of code). The format was used until we finally got new hardware and could turn it off - to save on ingress costs in the next subsystem.