eero at Scale: building distributed and highly reliable systems with Akka

Building distributed and highly reliable systems with Akka

Before we even started to develop eero, we knew that the single-router model wasn’t the future; it doesn’t work in the workplace or on college campuses, and it isn’t going to work in your home where a growing army of connected devices are all competing for a valuable resource: your WiFi. To provide reliable coverage and consistent performance across large homes and WiFi-crowded urban environments, we chose to build a distributed mesh platform.

This distributed-model only solves half of the problem, though. On the backend of eero, we face a separate challenge: how do we build a highly available, high performance infrastructure that’s able to communicate with each eero device? In order to do this correctly, it was important for us to choose an architecture that could scale up and out, without having to constantly rebuild it.

We knew that the web architectures we had worked with in the past wouldn’t make the cut. You know, the old three-level-architecture, stateless http thread-per-request model. We needed a system that would make it easy to model our domain as it grew in complexity and that could handle concurrency at a very high scale.

So we bet on Akka.

What is Akka?

Akka is a library for writing distributed, concurrent, fault-tolerant systems on top of an actor model. If you’ve never heard of an actor model, it can be a little difficult to conceptualize at first, so we’ll start with the basics. An actor is a thing that you can interact with by sending messages to it. In response to a message, an actor can take the following actions:

  1. Send a finite number of messages to other actors
  2. Create a finite number of new actors
  3. Change its internal state for the next message it receives

Akka’s implementation is incredibly efficient — it can process upwards of 50 million messages per second and millions of actors per host. Scaling out is made easier by Akka Cluster, which lets you run actors and deliver messages across a network of hosts.

What’s wrong with traditional web architectures?

A typical web service request looks like this: data is loaded from a database and processed according to the API being called, new data is written back to the database, and then a response is returned. With modern web frameworks, implementing these types of interactions has become increasingly simple. From libraries for routing HTTP requests to view functions to ORMs for easy database access, you’re only a few code snippets away from implementing your own personal blogging platform.

Where so many of these frameworks fail is in dealing with concurrency. Idempotency can be difficult when someone gets click-happy with a web form. Bespoke distributed locking mechanisms and centralized control systems are used to prevent worker processes from stepping on each other’s state. More concurrent requests, long-running workers, and massively parallelized jobs become very complex, very quickly. And as things scale, the introduction of performance optimizations, like caching, further complicates things.

The shared memory problem

Perhaps the biggest problem of traditional web architectures is that the database becomes the shared memory in a vastly concurrent system. One of the things you learn early on in multi-threaded programming is that shared memory introduces a lot of complexity. Primitives such as locks, semaphores, and mutexes are employed to guarantee consistency across concurrent threads. In typical web services, we rarely attempt to coordinate access to data. With the popularity of powerful ORMs and MVC frameworks, it becomes easy to fetch the data you need from the database in order to service a request. If you need to guarantee consistency, you’re on your own.

As systems become more distributed with multiple request servers, async workers, caches, etc., there’s an increased likelihood that different parts of your system have different representations of the same piece of data. As data moves throughout your systems, consistency is harder and harder to maintain.

Akka to the rescue

At eero, we model our domain objects and processes using Akka actors. Instead of moving data to code, Akka let’s us move code to data. The single source of truth for a piece of data, then, is the actor that owns it — not a row in a database table. When a decision needs to be made for an eero, network, or any other object, that happens in its actor, always on the same machine (unless the actor gets migrated) and always with the same piece of memory.

Let’s look at an example of how we use Akka as part of a simple feature that would actually be somewhat complicated at scale: online status of an eero.

In the eero mobile app, customers can see whether or not each of their eeros are online. The decision about whether or not an eero is online is based upon the last time we heard from it. In the traditional web service model, this is pretty simple. We would have a row or document for each eero in our favorite data store. When the mobile app requests the status of the customer’s network, we could read all of the network’s eeros from the data store, check their last communication time, and then set their status in the response. On the other side of the service, whenever an eero communicates with our service, we’d update its row or document to reflect the last communication time. Very simple, even without a fancy ORM.

However, things get complicated once you start to track the statuses of more and more eeros. A hundred thousand eeros could mean tens of thousands of last communication time updates per second. To solve this, you might use a different data store more optimized for high-throughput writes, or you might shard your data. Either way, you’re introducing complexity into your ORM and system operations. Suddenly a simple feature is consuming a lot of CPU cycles at your data store layer and you still haven’t really solved the problem, you’ve only mitigated it. The fundamental problem is this: the traditional model encourages pulling data to the decision, instead of pushing the decision to the data.

With eero, the last communication time for a device is tracked by an actor. When a device communicates with our service, that actor updates its internal state accordingly. When we need to know the state of a customer’s network, the actor for each device can be queried for the device status. Internally, the device actor can decide when and how to persist its state. Since last communication time is no longer needed across the system, it becomes an implementation detail of “device status”. This frees the actor up to store last communication time in memory and only persist changes to device status to the data store. Presumably, this is a database update that happens much less frequently than every time the device talks to the service.

If you’re interested in writing highly scalable distributed systems using Akka and Scala, eero is hiring.