Bookshelf: A Spiffy Space for Stashing State

Unlike many other WiFi products, eero is built from the ground up. From creating our own software — including TrueMesh — to every component that goes into each and every eero, to say the least, there is no other home WiFi experience quite like eero.

As a system, eeros work together as a single network, so no matter how many eeros you have or where you are in your home, your experience is always the same. For this to work, all the eeros in your network need to know the latest configuration. For example, when you change your network’s name (SSID), every eero needs to get the memo and update the SSID it broadcasts.

In the past, eero software has relied on our cloud services to coordinate changes to network configuration across a network. Launched as part of our latest release (v3.10), a new software component that runs within every network is in charge of keeping eeros in sync. We call it Bookshelf, and we’re excited to tell you about it.

Preface

Since we first shipped, every eero ran a precursor to Bookshelf. We called it the Parameter Server. It was a lightweight, in-memory database through which the software components on a single unit could communicate. It allowed us to develop loosely coupled components quickly since they each could use the Parameter Server as a common way to share and get access to that unit’s state.

There were drawbacks to the Parameter Server that, over time, began to slow our ability to innovate. Its API and encoding of data was fundamentally tied to the Python programming language, which many of eero’s management daemons were originally written in. As we moved components to more memory-efficient languages, for the rewritten versions to work with the rest of eeroOS, they all needed access to the state stored in the Parameter Server.

Around the same time, a team of eero engineers was brainstorming the next generation of features to build into our WiFi systems. Quickly, the discussion turned to what software infrastructure we would need to realize those features. These conversations crystallized the need for developer-friendly tools for communication within the LAN. Developers found the Parameter Server valuable and familiar but needed ways to change state on one unit and have that state become visible throughout the eero units in the network.

With these requirements in hand, a small team at eero started prototyping a new, replicated state server that would become Bookshelf.

Chapter 1: Get GOing

Surveying the landscape of memory-efficient systems programming languages with vibrant ecosystems and high developer productivity led us to evaluate the use of Golang. After successfully dipping our toes in the waters of Go with an earlier project, we began to feel more comfortable and decided to use Go for implementing Bookshelf.

Go’s rich standard library and ecosystem of networking and server packages allowed us to quickly prototype our ideas. The high quality of these packages and Go’s powerful concurrency primitives allowed us to productionize this prototype while keeping to our schedule. One feature of Go that offers a leg up over our previous Python development is the ability to use built-in production monitoring tools, like pprof, to solve bottlenecks and deadlocks.

Chapter 2: As Simple as Possible, but no Simpler

One of the biggest challenges we had in the early days of developing Bookshelf was deciding what features it would support and how data would be stored and organized within it. Debates sprung up between keeping Bookshelf simple and adding capabilities that would make it easier for developers to implement user-visible features.

The environment Bookshelf runs in is very different from that of the many replicated or distributed data stores that were born in the datacenter. Although there are many similarities (eero units are in some ways smaller versions of the Linux servers that comprise the tech giants’ clouds), there are differences that influenced our design constraints.

The first is the size of networks. When you run your own replicated service in a datacenter, you can choose the best number of the hosts on which to run that service.

For example, Apache ZooKeeper might recommend you run it on three or five hosts; fewer and the service won’t be available when there are hardware failures, more and the overhead of coordination slows the service. However, at eero, we do not have control over the number of units in a network. There are networks with a single eero unit, others with two units (a particularly pernicious number for distributed systems), and a handful networks with more than twenty! Given this range, we needed a system that would tolerate all sort of combinations of failures and partitions.

The second constraint is around maintainability. Datacenter software has become smarter and more self-regulating over time but still can rely on teams of DevOps or site reliability engineers to troubleshoot when alerts go off. For our users’ privacy, no one can SSH into eero networks, poke around, edit configurations, restart services, or delete corrupted files. Bookshelf needs to be able to recover from as many failures as we can imagine (and some we have not yet), so our users’ networks work all the time.

Being mindful of these constraints and the small team on the project, we focused on simplicity. Bookshelf is replicated, but has no partitioning or sharding; every instance has a copy of the full state of the replicated database. This removed the need for either a centralized metadata service or distributed hash table to locate where data was stored. It also ensured all data would be available locally. Beyond performance benefits, this locality is crucial since some of the state in Bookshelf is required to establish the networks mesh links over which Bookshelf instances communicate.

A second key concession to simplicity was to bias the design heavily towards availability over consistency (in the sense of the CAP theorem). Past experience has taught members of the team that consistency (particularly for the write path) can simplify client logic by offering behavior more like a traditional, non-distributed data store. However, in light of the variously sized networks Bookshelf would have to run in, a system that would refuse to accept updates when one half of a two-unit network was unavailable seemed unacceptable from a user experience perspective.

So, when partitions happen, and not all units in a network can talk with each other, Bookshelf continues to allow updates to state, and our smart management software can change settings to get the network running again. When a client application writes a value to a Bookshelf instance, that instance will update its local copy and forward the new value to all peers that are still reachable.

Once the partitioned instances become reachable again, the peers will run a reconciliation protocol to synchronize their states. Each instance maintains a Merkle Tree tracking its state, so peers can quickly determine whether they are in sync when they first connect and, if not, the reconciliation protocol merges the two nodes’ states using last-writer wins.

In its current design, Bookshelf supports a key-value structure and the ability to subscribe to changes to one key or a range of keys by prefix. Data is organized hierarchically in an informal way to provide documentation, discoverability, and ease of subscribing to related state. Keys are structured like file system paths, but Bookshelf treats them only as strings; there is no first-class notion of a directory. There are no transactions or snapshots. These decisions allowed us to quickly implement a system whose behavior we could reason about in different scenarios.

Another way we kept the design simple is by separating concerns and building on top of other technologies. One of these is the eero public key infrastructure described in an earlier post on this blog. Bookshelf stores sensitive network configuration state, such as networks’ passwords. The certificates managed by the eero PKI ensure that only authorized devices (eero units within the network and the network owner’s mobile phone) can access or modify Bookshelf’s data. After clients authenticate, those same certificates are used to establish TLS sessions to keep data confidential as it is sent across the encrypted LAN.

Lastly, using IPv6 allowed us to simplify peer discovery and network bootstrapping. Home networks are volatile places; users can be quick to reboot units to troubleshoot connectivity issues, others power off parts of their mesh during the night (either intentionally or because those units are wired to circuits turned off to save power), and new units are added to networks to expand coverage. Our cloud services provide each unit in a network with an up-to-date list of all the peer units. Each unit has an IPv6 link-local address, which enables Bookshelf to connect and synchronize as soon as the layer 2 meshing occurs – without having to wait for layer 3 IP networking components like DHCP to assign addresses.

Chapter 3: Where We Go Next

At eero, we will continue to build new software leveraging Bookshelf and, in doing so, learn more about its strengths and weaknesses. Education of developers using Bookshelf has been key to avoid pitfalls and anti-patterns that come with its simple design. Features that use Bookshelf for publishing state to the other units on the network, such lists of the clients connected to each eero, have found Bookshelf easy to use; others that require more coordination between the members of the network have required more careful design and planning to be successful. Some of the coordination patterns will likely be codified in libraries so they can easily be reused by other components.

Our long-term plan is to integrate Bookshelf with our cloud services. We want to reuse Bookshelf’s replication protocol to send updates to our cloud, where they can be cached in case users query state through our mobile application and so the cloud can act as a backup of networks’ configuration.

Furthermore, as Bookshelf and the other parts of eero’s smart network platform matures, we intend to move the implementation of some functionality from our cloud services to the eero units themselves. This local computation will keep more data about the health and state of our users’ networks inside their own networks, reducing what needs to be shared with eero servers.

If you found this post interesting and would like to learn more (and contribute to awesome projects like this), stop by our careers page.

Written by Matthew Richards, technical lead on the Connectivity Systems team.

Acknowledgments

Although we decided to build Bookshelf rather than use an existing tool, this was not done in a vacuum. The design was influenced by Apache Cassandra and the Ledger store in Google Fuchsia OS.  

Bookshelf relies on a number of open source projects, most notably Google’s gRPC and Ben Johnson’s and CoreOS’s BoltDB.