Build a platform for ingesting tweets

September 8, 2016

When working at Tooploox, one of our longest contracts was with NowThis. What started as a two-man gig to build a small block in the overall platform ended up becoming dozens of people working in several teams on two platforms in a multi-year partnership.

Humble beginnings of a new platform

When publishing to multiple outlets, knowing the results is crucial. As such, it was always a given that at some point we’ll build a part of a platform that deals with analytics - simply telling people to sort separately through data given by each publishing platform (Twitter, Facebook, RSS) was an intermediate solution. We simply lacked the manpower to make it happen.

It all changed when one of the prototypes created by the Data Science team became a hit. What initially was a 1-2 man auxiliary engineers reviewing code written by scientists was bumped to 4 people tasked with building a platform to deploy that prototype, create more of them, and make them reliable. The goal was to be able to ingest real-time multi-platform data (tens of thousands of data points per second).

The prototype? It was saving HTML from Facebook pages every hour and saving them to CSV.

Lambda architecture

In 2016 the "Big Data" craze was just starting. Twitter was still using Postgres, Facebook was PHP, and state-of-the-art were OLAP cubes. After scouring the internet and literature for ideas the initial pitch for the platform became:

Each data ingestion mechanism creates a data "stream" with a homogenous data structure,
"Streams" are persisted to a central location and are queryable only in relation to each other (e.g. "Give me the next 100 events after UUID XX…"),
Downstream processing is organized based on those streams, with microservices having their own storage, but only deriving data from streams in the central location.

It’s not that far from the prototype if you squint hard enough - one service takes the place of CSV files, with a very similar structure, and mostly everything stays the same. The improvements were:

only one service which had to scale (the warehouse),
data could be easily re-processed (in case of errors, new prototypes, or migrations),
extensibility - e.g. adding a real-time processing pipeline would only require us to push data from the central location to some queue,

Getting there

The biggest question was whether we could pull it off - the main focus at that point was still the publishing platform, so the three engineers (including me in charge) had limited support from Ops and not a lot of experience with this kind of setting - all of us were previously mainly working on Web projects.

We used that to our advantage. After benchmarking several storage solutions (The Jepsen series of articles was popular at that time), we decided to base storage on Cassandra. It had a simple model, SQL-like DSL, and deployment was easy enough to get buy-in from Ops. After deciding on the data model, we only had to write HTTP API for writes (from ingestion) and reads (from products). We wrote it in Ruby and named the microservice simply "persistence".

Ingestion didn’t change much initially - instead of writing to CSV, it wrote to our persistence API. The "prototype" was more of a challenge - it needed cleaned and smoothed-out data and had an exotic query model - a statistical model that had been kept fully in memory but was at that point too large to continue so.

Someone suggested a fledgling company called InfluxDB, whose product fit our use case quite perfectly - and even promised features that could reduce the amount of data preparation necessary. In short, ideally, we could just load it up with our "raw" data and it’d prepare a model for us to query on its own. Seemed perfect. We took the offer, asked Ops to deploy it on our servers, and completed the transition.

The last missing part was deployment - thankfully we could just piggyback on top of work done for the main platform. It used Nomad to orchestrate Docker containers on multiple servers. Initially, we used just three, as this was the minimum amount needed to achieve real-time data resiliency.

The Insights platform

The productized prototype was enough of a success to secure more work on this side of things. The team was scaled to four and later six full-time engineers. The larger team (with me becoming Tech Lead of that platform) delivered more integrations with social platforms (even webhook-based ingestion from Facebook) and more use-facing products and prototypes.

The biggest challenges at that point were scaling persistence API and keeping Influx running. The former problem was caused by language (Ruby) and implementation quirks which meant it didn’t scale horizontally. We solved it by re-implementing it in Golang, benchmarking it thoroughly, and migrating all historical data to the new API. Then we just switched all readers. The latter problem was inherent to Influx - at that point, it needed all data in memory for queries and had no mechanism to stop data from overflowing memory (and getting killed due to OOM). We fixed it by moving away from Influx and leaning more on Redis, Postgres, and Cassandra as storage mechanisms.

The last integration delivered by this team was a visualization of Facebook ad engagement post-publishing in the main product.

Snake eats its tail

In a quite ironic series of events, it turned out the main Publishing product was not a hit and became discontinued - unlike the Insights platform, which proved to be providing consistent value. This resulted in massive scaling - from one to three engineering teams, and from three to later six and nine bare-metal servers. My work shifted from architecture & backend to infrastructure & specific product integrations. In the end, this platform ended up storing around 30TB of data from 6+ years of publishing history. It successfully ingested data at a rate of 30000-50000 "events" per second and provided products with updates with 1-minute latency. It had 99.87% uptime. I had the pleasure of fulfilling several roles: Tech Lead, Developer, Operations, and working with several fun technologies; aside from the ones mentioned DataDog, Chef, Apache Spark and Zeppelin, and Elixir. As part of a small team, I had hands-on experience with each of them and not just an occasional code review.

Tags: agile postgres tooploox influxdb backend chef ruby docker cassandra golang redis

Making organisation self-organizing