Distributed Data Stores: the birth of a new layer in the stack

I learned web programming in 1995, when a SQL database for storing your data was the obvious choice, but the options were still few, expensive, and slow. Since then, the SQL database has become ubiquitous, and the options are many, including at least two very solid free/open-source solutions. But when it comes to large datasets, the paradigm for data storage is in the early stages of a radical shift towards distributed data stores. Google’s got its own (BigTable). Amazon’s got its own (Dynamo). And now Facebook, LinkedIn, StumbleUpon, and other web sites with large datasets are moving in the same architectural direction, typically by starting or contributing to some open-source project that mimics either Google or Amazon’s design.

The key aspect of this paradigm shift is, in a sense, like moving from Classical Physics to Relativity. Where SQL databases mostly provide one central view of the data, the distributed data stores are much looser: the state of the data really depends on your point of view, and what one user sees may not be the same as what another user sees at the same time, but it doesn’t really matter because “at the same time” is mostly a meaningless concept. The only things that matter is that there’s:

  1. some degree of localized consistency for a given user, i.e. if Alice accepts Bob’s friend invitation on Facebook then Alice immediately sees Bob listed as her friend, not 30 seconds and 10 clicks later, and
  2. global consistency eventually, meaning eventually all of Alice’s friends can see that Alice befriended Bob, but they don’t all have to see this the moment Alice does.

This is a very difficult design principle to accept if you’ve been doing SQL databases for 14 years like I have, but it turns out to be necessary if you want to scale to terabytes of data and Facebook-level traffic. I suspect some folks will keep denying the inevitability of this shift: just get a bigger DB server, get faster drives, use fiber-channel to your Storage Area Network, partition your dataset into multiple databases, just do Service-Oriented Architecture, use Oracle 37i Replication, … Yeah. Ok. Or maybe Google and Amazon actually have a clue.

Today I was at the No SQL Event, which featured presentations from most of the major open-source distributed data stores. The tech IQ of the assembled group was extremely high: I’ve rarely seen so many clearly talented developers in one place. Also, it’s fascinating to see a number of more advanced academic results now sprinkled throughout developer presentations. “Paxos is way cool!” “Let me tell you about Merkle Trees.” “Actually 2-phase commit can hang forever if one of your nodes goes down.” “Typically, applications like to read the latest value they wrote.” There’s something fascinating going on here, with some people going back to the literature while others rediscover and reinvent. But that mix is okay, of course, because in the process folks are figuring out the practical tweaks needed to actually make the algorithms work. This is truly the birth of a new layer in the application stack, and oh by the way, it looks like everyone in this field is hiring.

So, should you drop your SQL book and start learning Cassandra? Not necessarily. SQL isn’t going away all that soon, and many small and medium web applications will be just fine on SQL databases. Plus, these systems are not exactly plug-and-play yet. But it’s worth checking out Google App Engine, and Amazon SimpleDB, as those are up and running and easy to try. Then, take a look at HBase, Cassandra, Voldemort, and CouchDB. If you read the documentation and think “wait a minute, I can’t do a join? I don’t have transactions across my entire dataset? I have to do this crazy map-reduce instead of a group-by? What kind of crappy database is this?” then that means you’re starting to get it. There’s no such thing as a big clock in the sky, and there’s no such thing as large scalable datasets with perfect consistency. The future of large web apps requires a more relaxed outlook on data. It’s fun to see this technology in its infancy, and it’s exciting to think what will be in another 14 years.

UPDATE: Todd Lipcon’s intro presentation was a good overview.