Let’s not let “NoSQL” go the way of “Web 2.0”

by Adam Ferrari

Adam Ferrari

I get asked all the time whether Endeca should be considered a “NoSQL” database. It’s a totally reasonable question.  After all, our core engine shares some attributes of a NoSQL system – it’s a persistent data store, has a non-relational data model, and has convenient APIs for developing web applications. And it works at high scale, powering some of the highest traffic eBusiness applications on the web. But despite these shared traits and my deep respect for the innovation happening in the NoSQL movement, I don’t use that label to describe our technology.

Dave Kellogg of Mark Logic recently addressed this question in relation to their XML database and reached a different answer. His post provides a nice overview of many of the goals and anti-goals of the NoSQL movement. He also gave a well articulated warning against mixing up technical considerations with open source considerations. But one of his fundamental points is that NoSQL is a protest movement (a Tea Party of sorts, per his title) against the limitations of mainstream DBMSs, where the protesters have a variety of motivations for departing from the mainstream. For example, NoSQL alternatives typically provide a range of non-traditional DBMS characteristics such as scalability to handle very large web applications, non-ACID data consistency semantics, open source development, and non-relational data models like JSON. And based on this last point – the segment of the “protestors” marching against the rigidity of relational data model – Dave includes XML databases under the rapidly growing NoSQL umbrella. And it’s a totally reasonable argument if you buy into the “many reasons for protest” interpretation of NoSQL.

But while I clearly agree there are many good reasons to move beyond traditional relational database technology, I don’t think an omnibus interpretation of the NoSQL movement is quite right, so I reach a different conclusion. In my experience, “NoSQL” (which in retrospect was probably not the best choice of names) means something more specific than “any non-mainstream database.” Its original motivation stemmed from the new needs of wide area, web scale applications. This web orientation informs the form factor differences from mainstream DBMSs – things like web-friendly data models and APIs. But those aren’t what make NoSQL special. Instead, it’s the core assumptions of massive scale and widely distributed hardware infrastructure that drive the most interesting distinguishing characteristics of these systems.

For example, NoSQL databases are pioneering innovative approaches to data consistency. Wide area systems, where hardware faults and network outages are expected instead of exceptional, have to consider carefully the CAP theorem originally posed by Eric Brewer in 2001. CAP tells us that a distributed persistent data store can provide any two of (1) data consistency, (2) availability, and (3) resilience to network partitions, but not all three.

For traditional enterprise databases, data consistency (1) and high availability (2) are critical requirements, while network partitions (3) are not a big issue, so ACID transactions are the way to go. But web scale systems have to cope with real occurrences of network partitioning (3), and of course maintain high availability (2) (or suffer loss of user satisfaction, revenue, etc.). Luckliy, data consistency guarantees (1) can be looser in these systems, leading to new models like “eventual consistency.” You can find a strong overview of the underlying issues and approaches in this excellent write-up by Werner Vogels of Amazon (which, incidentally, predates the coining of the term NoSQL by more than a year – the issues in play are much older than the label). The key point is providing different trade-offs in CAP space is not just one of many motivations for NoSQL, it’s a central one.

And data consistency isn’t the only requirement that’s different in large web scale systems. Massive scalability is a necessity. And not just data scale, but extreme scalability in key query performance metrics as well. For example, Raghu Ramakrishnan, Chief Scientist for cloud computing at Yahoo spoke at the recent New England Database Summit about their PNUTS/Sherpa system. You can check out some of the scale requirements that they’re building to in his slides: tens of TB of data, consisting of wide structured records, serving tens of thousands of queries per second, all with extremely low latency (i.e., low enough latency that you can perform many such requests in the process of building up a web page to serve users who expect an instant response to every click!). With those requirements, highly scalable and high-performing approaches to sharding, replication, and caching are important new ingredients.

Wide-area web applications simply have very different requirements from traditional enterprise DBMS-backed applications. And they need to make fundamentally different assumptions about the hardware environment. These combine to inform very different architectural and algorithmic solutions. Happily for the database community, the recognition that “one size fits all” is untenable for DBMS architectures has been growing steadily over the last decade, and the NoSQL movement is a nice example of that recognition in action.

So, returning to my discomfort about the “big tent” interpretation of the NoSQL concept, you might ask, “Why do I care? Why not accept a broad interpretation of that category and the buzz that goes with it?” The main reason that I don’t is very simple: focus. Massive web scale databases are invaluable. But there’s a whole world of enterprise information access problems that is also worthy of focused attention.

For example, the query types are different. NoSQL databases specialize in the query capabilities typical of the applications they target. They primarily provide keyed lookup and update of specific records, and relatively focused range filters or searches. But they don’t generally need large scale aggregations or joins. So while these systems provide massive, distributed, highly available data stores, they don’t offer the kinds of query capabilities needed for analysis and reporting over large collections of information. And with good reason – those features would be hard and expensive to provide at that scale, and they aren’t the point of the applications they target.

A similar dichotomy exists in the relational world, where some systems (most obviously the big commercial RDBMSs) are optimized for data storage and management (i.e., the stuff that most enterprise applications such as ERP rely upon), and these are outperformed by an order of magnitude or more by systems like Vertica, which are optimized for data analysis (i.e., the stuff that BI applications rely upon).

As part of a team focused on enterprise-oriented information access problems, which are a different beast from wide area data stores, I don’t apply the “NoSQL” label to what we’re doing. At our core, we’re targeting different problem spaces. And I have a huge amount of respect for what the NoSQL movement is doing. For example, the work being done on consistency models like the Vogels paper I mentioned above is big league computer science that is making large contributions to the ways that technology can play bigger and more helpful roles in our lives. I’d just hate to see the “NoSQL” label go the way of “Web 2.0,” a moniker that rapidly came to mean everything and so nothing at all.

  • Share/Bookmark
Posted on March 10, 2010 at 5:25 pm · Permalink
In: databases

4 Responses

Subscribe to comments via RSS

  1. Written by Dave Kellogg
    on March 10, 2010 at 8:38 pm
    Permalink

    Nice post, Adam.

    I agree that roots of NoSQL were in large scale and distributed (commodity) hardware infrastructure. But, from the database mafia perspective, “database people” (of which I consider myself one) would most certainly want to do those apps and not have databases excluded for scaling and/or hardware reasons. Scaling to big and exploit large hardware clusters are supposed to be things that databases do.

    Put differently, from the database community perspective, it is a “failure” that people moved to non-database (or un-database) systems for these applications. That’s why Stonebraker blasted them early on and get heavily blasted back in return.

    In the end, I think traditional databases have two key flaws which are driving folks the NoSQL direction: [1] they don’t handle unstructured information well, and [2] they set too high a bar on transaction consistency for some applications. As you / Brewer point out, for many applications you want to make trade-offs that database prohibit because they take a very purist approach.

    I agree on the analysis vs. simple lookups bent but would note that folks like Aster, who embed MapReduce in a data warehouse DBMS, enable you to do both.

    I don’t view Endeca as a NoSQL system either, though I do suspect NoSQL will go pretty much exactly the same way Web 2.0 — i.e., get over-used and hyped into meaning pretty much nothing at all.

    Best,
    Dave

  2. Written by aferrari
    on March 11, 2010 at 7:55 am
    Permalink

    Dave,
    Thanks for the comment. As always you make solid points. But I don’t think NoSQL has to go the way of Web 2.0, and I continue to view the positioning of tech like XML databases under the NoSQL label as problematic. Lumping the ideas together only serves to confuse the issue for the very managers for whom you’re trying to help clarify reasoning around using non-mainstream DBMSs.

    The protest model of NoSQL doesn’t ring true to me. It’s absolutely true, as you point out, that many “database people” view NoSQL as problematic. But then many serious database people are also participating in moving NoSQL tech forward. A quick anecdote is illustrative: at a recent MongoDB talk, Stonebraker stood up to scold the speaker for inventing a new data model without a principled motivation. But Mike’s former student Sam Madden immediately stood up to point out that the Mongo guys hadn’t exactly invented JSON, and that it was a natural choice since so many web developers are comfortable with it. It doesn’t seem like a protest argument. More like a technical discussion with smart people on both sides.

    And as to the passionate responses you’re getting on your blog, I don’t think that’s frustration with the DBMS oligopoly manifesting as “religion.” It seems more to me like the passion of a graduate student defending a dissertation, where you’ve got an idea that’s new and real but not everybody gets it. When you see someone not getting it, you just want to share (and yes, validate) the technical insight.

    -Adam

  3. [...] Development by François Schiettecatte on March 11, 2010 Great post by Adam Ferrari titled Let’s not let “NoSQL” go the way of “Web 2.0” on the new to focus the definition of “NoSQL” lest it turns into the term “Web [...]

  4. Written by Enlaces rápidos (15-03-2010) | Sentido Web
    on March 15, 2010 at 8:33 am
    Permalink

    [...] Let’s not let “NoSQL” go the way of “Web 2.0” [...]

Subscribe to comments via RSS

Leave a Reply