When saving entities to App Engine’s datastore at a high write rate, avoid monotonically increasing values such as timestamps. Generally speaking, you don’t have to worry about this sort of thing until your application hits 100s of queries per second. Once you’re in that ballpark, you may want to examine potential hotspots in your application that can increase datastore latency.
To explain why this is, let’s examine what happens to the underlying Bigtable of an application with a high write rate. When a Bigtable tablet, a contiguous unit of storage, experiences a high write rate, the tablet will have to “split” into more than one tablet. This “split” allows new writes to shard. Here’s a visual approximation of what happens:
There’s a moment of pain – this is one of the causes of datastore timeouts in high write applications, as discussed in Nick Johnson‘s article, “Handling Datastore Errors“.
Remember that for indexed values, we must write corresponding index rows. When values are randomly or even semi-randomly distributed, like, say, user email addresses, tablet splits function well. This is because the work to write multiple values is distributed amongst several Bigtable tablets:
The problems appear when we start saving monotonically increasing values like timestamps, or insert dictionary words in alphabetical order:
The new writes aren’t evenly distributed, and whichever tablet they end up going to end up becoming a new hot tablet in need of a split.
As a developer, what can you do to avoid this situation?
- Avoid indexes unless you need to query against the values. No index = no hot tablet on increasing value
- Lower your write rate, or figure out how to better distribute values. A pure random distribution is best, but even a distribution that isn’t random will be better than a predictable, monotonically increasing value
- Prefix a shard identifier to your value. This is problematic if you plan on doing queries, as you will need to prefix and unprefix the values, then join the results in memory – but it will reduce the error rate of your writes
The tips are applicable whether you are on Master-Slave or High Replication datastore. And one more tip: don’t prematurely optimize for this case, since chances are, you won’t run into it. You can be spending that time working on features.
– Ikai
P.S. Yes, I drew those doodles. No, I do not have any formal art training (how could you tell?!)