Unclogging the update drain

The other interesting part of working at Opera in the Statistics Platform team was providing tools for analytics teams - multiple of them. In practice, this meant that quite a small team had full responsibility and accountability over what data was ingested and how it was processed.
Due to the size of the data lake and the processing power needed to aggregate results, we needed to have a separate server dedicated to hosting a JupyterLab environment. One interesting plot twist was the need for recurring jobs: certain aggregations needed to run periodically, to refresh dashboards with data, serve reports or prepare data for product teams. This meant the aforementioned beefy machine also had to host these jobs - as nothing else in our data centre could.
Essentially everything ran within docker containers - one persistent container per user, https://github.com/jupyterhub/jupyterhub for authentication and routing requests, and a good deal of hand-written scripts to ensure containers are running and periodic jobs are executed within these.
When I got assigned the task of rearchitecting this piece of platform, it was already suffering from a couple of major problems:
- stale packages - the docker image was built a couple of years before and conflicts within transitive dependencies precluded ever updating it,
- poor resource isolation - any user could essentially kill the machine by hogging whole memory,
- brittleness of scripts - since a lot of orchestration was hand-written and tweaked manually, rebuilding the machine from the ground up required a good deal of manual interventions (contrary to anything else in our stack, where Puppet could re-create whole host in a matter of minutes),
Naturally, the goal was to address the shortcomings - whilst ideally also not incurring a lot of extra work on other teams.
Decisions, decisions
The first step was to compile a list of decisions to be made; pick essentials and build a proof of concept to validate these crucial choices. Then coordinate with other teams, make more choices and create a minimal viable version of the new system. Afterwards, people could start migrating, and extra features can be added in parallel - depending on what’s needed and what feedback was given.
In short, the new system kept the essentials of the old one (JupyterHub, Docker), while removing most of the custom Python code around periodic jobs in favour of:
- Kubernetes (specifically https://k3s.io ) for execution,
- JSON files (stored in git) for parameters,
- https://jupytext.readthedocs.io for scripts themselves,
There was still a decent amount of custom code, but overall much less - and more isolated, easier to replace independently in the future.
Overall, the system's inner workings consisted of:
- Docker image pipeline - once a week, a new one is built - with the latest versions of all packages required by users. There are a couple of flavours to avoid big spaghetti of dependencies,
- validation - new image is picked up, tested and set as default for this week,
- updates - users running old images are notified - and those lagging more than a week are forcibly replaced,
- persistence - notebook runs are saved as markdown files to disk and users commit them to git,
- job definitions - to add or change periodic jobs, user commits a definition (as JSON file),
- scheduling - repositories with periodic jobs are scanned and each definition creates a new workflow in Airflow,
- execution - each workflow is then executed via Kubernetes, on the same host with access to user-local configuration,
The brittleness was also addressed by ensuring everything was done via Puppet - and making sure you could always create a local VM to test out the whole process.
Resource isolation was fixed using systemd primitives - each container was run in a separate cgroup parent, with limits set accordingly to whether it was a periodic or interactive job.
Iterating on the system
Pretty soon it turned out that the first big problem with adoption was Git. My preferred solution was to teach people more about it - I don’t believe you can escape the plumbing. After ensuring people had some basic knowledge, we experimented with a couple of UIs to ease up most common workflows - after some tweaks https://github.com/jupyterlab/jupyterlab-git became a permanent feature of the new system.
The next challenge was writing JSON definitions - Jupyter offers no real editor for these kinds of files and doing so in the text editor is quite error-prone. The best you can do is copy-paste some other definition and adjust parameters, hoping you didn’t leave a trailing comma. In this case, we experimented with several approaches:
- we defined schema using https://json-schema.org to allow validation,
- using that, we created UI with https://github.com/json-editor/json-editor,
- since VSCode has good support for JSON editing, we also added a https://github.com/coder/code-server alongside JupyterLab,
Neither of these approaches proved satisfactory though - I think in the end latter two were entirely scraped and just the validation remained.
Unsurprisingly the elephant in the room was the migration of old jobs. Even though we were able to automatically migrate certain kinds of simpler jobs, the breaking changes in packages forced analytics teams to re-check and fix every notebook. This made the migration much longer than necessary - over a year long. From hindsight, maybe it’d be better to provide them with some sort of "transitory" setup, where old definitions were more likely to work. But that again would prolong the "proper" migration, until everything was on rolling setup.