How Reviewable DoSed its own Firebase

And how you can avoid the same fate

By: Piotr Kaminski
Published: Tuesday, September 22, 2015

Reviewable is built on top of Firebase and had a rough start to September, with the whole app dropping out for minutes at a time. I've finally gotten to the bottom of the issues and fixed them, but thought it best to document everything for the benefit of other developers.

bunny shark sb_float

If your app uses Firebase, did you know that chances are anyone can knock your app offline, even if you've got extensive security rules set up? If not, then read on for the gory details. One caveat first, though: everything in this posts is accurate to the best of my knowledge as of this writing—and many things cross-checked with the friendly Firebase team—but subject to change in the near future. Watch this blog for updates.

Mysterious server outages

Reviewable's backend runs on a small group of NodeJS servers that continuously lease tasks from queues held in Firebase. One of the many instrumented metrics checked every minute is the maximum queue latency: the amount of time it takes for a trivial "ping" task to be picked up and processed from each queue. The normal latency is around 115ms, and I consider occasional spikes up to 1 to 2 seconds acceptable, but near the end of August I noticed that there were suddenly a bunch of delays in the 1 to 2 minutes range. Not good.

rooting around inside a server sb_float

It took me a while to try to correlate the outages with other data and test a bunch of hypotheses, initially centered around possible bugs in Firelease. Eventually, I added extra checks for Firebase latency and came to the conclusion that, during the outages, no message were flowing from the Firebase server even though the connection reported itself as healthy. I gathered up some logs and contacted the consistently awesome Firebase support folks, figuring this was some new kind of bug in their servers or the SDK. I promptly received a reply: working as expected.^{1. Um... wat?}

The needs of the many

It turns out that starting around mid-July Firebase rolled out a "throttling" feature to mitigate the impact of projects with bursts of heavy usage on their colocated neighbors. Before this change, certain kinds of traffic on one datastore could slow down all the other datastores hosted on the same server, so it certainly made sense to protect the many at the expense of one "misbehaving" project.

However, in my opinion they made a few implementation mistakes:

The so-called "throttling" doesn't just isolate a single guilty connection or slow down overall processing for one datastore. Instead, it completely shuts down all communications for all clients of the datastore for up to a few minutes.
There's no developer-visible notification that a throttling event has occurred. The events are tracked internally at Firebase, but...
There's no monitoring set up to reach out to projects that are being throttled too often and help them remedy the situation. At peak times, Reviewable was being throttled multiple times per hour!
There was no announcement of the feature's launch, nor any documentation that I could find about its functionality or existence.

This adds up to potentially app-crippling mini-outages that are very hard for a developer to figure out.

Defense against the dark writes

What triggers the throttling? It's tricky to specify exactly, but basically large writes to your datastore—especially "wide" ones with a lot of children at the same level of the JSON hierarchy. You must avoid or split up large, wide writes to avoid throttling.

In Reviewable's case, I originally set up GitHub webhooks to be postsed directly into a datastore queue for convenience, but some of the larger push events weighed in at 10MB or higher with thousands and thousands of children. Since Reviewable doesn't actually care about most of the details of each event, it was a fairly simple matter to direct the webhooks to my own servers instead and strip the events down before enqueuing them into Firebase.

barricade door against dark bunny sb_float

Problem solved, right? Not so fast. Even if your app behaves nicely, you probably allow users to write directly to your datastore from the client—that's kinda the whole point of using Firebase, after all—and there's no telling what they'll try to write, whether accidentally or maliciously. Your Firebase security rules (you've got some, right?) will limit where they can write and validate the items in your app's schema, but unless you've been extra careful they'll also allow unvalidated writes of any amount of extra data.

This is because the rules schema language is open by default: any key not mentioned is allowed and goes unchecked. You can either sprinkle "$other": {".validate": "false"} throughout all your non-wildcard items, or use a rule transpiler like Fireplan with a language whose items are closed by default unless you explicitly add .more: true to their definition. If you do that, and also ensure that users can't batch-write too many items at once to wildcard subtrees of your schema,^{2. you should be safe from DoS attacks on this particular vector.}

Future perfect

Dealing with all this was a pain but it's not enough to put me off Firebase. They're aware that the current situation is suboptimal and are planning to introduce monitoring improvements (soon) and infrastructure improvements (later) that will ease the pain going forward. And if you run into any issues, don't hesitate to reach out to their support team—it's probably the only part of Google where you can get a real answer to your question within hours without paying a ridiculous amount of money for a support contract.

Onwards!

Dramatized here for your entertainment; the actual reply was obviously far more helpful.
How exactly to do this is highly application-specific and may not always be possible. Potential approaches include narrowly limiting the format of the keys, or allowing writes only at the individual item level.