MapReduce just semi-good for semi-structured data

by Adam Ferrari

Adam Ferrari

Sybase recently announced that it’s become the latest analytical database vendor to hop onto the MapReduce (MR) bandwagon.  This trend has been ramping throughout the past year+, ignited by early innovators like Aster and Greenplum, followed fairly quickly by others such as Netezza and (somewhat surprisingly) Vertica.

One local outcome for yours truly has been lots more questions directed my way about the applicability of MR for semi-structured data analysis. After all, if this stuff is a de-facto “must have” in the relational ADBMS market, mightn’t it also be in the semi-structured analytics space?

My answer is that, on the one hand, much of what’s been said about MapReduce as it relates to the high-performance RDBMS domain (both positive and negative) holds perfectly well in the world of semi-structured data. But MR and the DBMS are different tools, with different sweet spots. And the challenges of unlocking value in complex, semi-structured data up the ante on understanding when to apply each.

Let’s recap the germane background about MR. At the highest level, MR is a generic algorithm for parallel processing. It’s generic in the sense that MR, or more precisely implementations such as Hadoop, provide the skeleton of the algorithm, which can then be instantiated to solve real problems by providing implementations for the “map” and “reduce” functions (For you programmers, this is analogous to something like STL sort, which provides a good general purpose sorting algorithm that you can specialize for your favorite data type and notion of order by providing the comparator).

Good general purpose tools for parallel computing like MapReduce are extremely valuable. After all, parallel programming isn’t a walk in the park, but parallelism is the natural way to tackle problems that involve processing large amounts of data. So, so far so good – MR and big data seem like a match made in heaven, and in many interesting cases there’s no question that they are.

But now let’s come back to our friend the DBMS. One of the truly, deeply great things about DBMSs is that they provide declarative query languages, like SQL and XQuery, which give you the ability to ask for what you want without having to describe how to compute it. This allows the database to automatically select the best algorithms (e.g., HashJoin versus MergeJoin), apply these in an advantageous order, accelerate query evaluation by automatically using indexes when they are available and appropriate, etc. And more to the point, declarative querying creates the opportunity for DBMSs to automatically parallelize query evaluation behind the scenes. Many DBMSs are great at parallelizing query evaluation, and again, since parallel programming is so hard, this benefit of declarative query languages is huge.

But this in no way diminishes the value of MR – MR kicks in as a power-tool when you can’t do the job with a database alone. That might be because you can’t readily express your data analysis in the DBMS’s query language. Or it might be because you don’t want to take the time to load your data into a DBMS (e.g., web search indexing, some types of log analysis).

MR and databases are simply different tools, useful for different problems. And there’s a class of problems where both are needed together, which is where the raft of MR-related product announcements in the analytic database space comes from. Namely, the data of interest is in the DBMS, but the analysis is best expressed using MR.

So what about the semi-structured space? Here I believe the combination of MR and DBMS is less compelling.

First of all, on a simply pragmatic level, semi-structured databases are a newer class of technology, and consequently, they face fewer limitations than the RDBMS for semi-structured data analysis. Semi-structured databases were designed to provide more expressive query capabilities than SQL. For example, XQuery provides the power of a general purpose programming language while retaining a strongly declarative nature. More expressive query capabilities mean you can get more of your analysis done in the DBMS, retaining the benefits of declarativeness like query planning, optimization, and parallelization.

But there’s a deeper and much less obvious reason for wanting to “stay declarative” and not resort to custom coding in MR. The reason is query refinement, a technical approach that addresses one of the key requirements for analyzing semi-structured data: the use case of allowing people to make sense of complex, unfamiliar data sets.

What’s so important about query refinement? In real-world situations, when the data gets complex, query writing becomes hopelessly complex. Even a powerful language like XQuery isn’t the solution; the person composing the query has to understand too much about the structure of the data. What’s needed is a tool to help users build interesting queries incrementally, keeping them on the path and showing them the interesting next steps within the context of their current data working set: a tool that provides a rich palette of contextual query refinement opportunities.

And the title of our blog notwithstanding, I’m not equating query refinement with faceted search here. It’s a superset of faceted search. Query refinement interfaces can suggest all kinds of interesting ways to build a query such as joining in related data, aggregation, sorts, and much more.  Of course, query refinement can’t analyze and expand arbitrary programs (you run into a little hitch called the halting problem). But it’s an amazingly powerful way to get users who are not experts in the data at hand to a big space of interesting analyses within the declarative confines of the DBMS query language. Escape out into the world of custom coding, and you forgo this power.

So returning to the question at hand, is MR applicable for analyzing semi-structured data? Of course the right answer is going to be “yes” for some applications, for very same types of reasons that crop up in the structured data space. My hope here is to challenge some of the muddier thinking about MapReduce and its applicability as a DBMS replacement for analytical use cases. The power of the DBMS abstraction is not just about how quickly you can get an analysis done, but about how easily and directly users of varying skill levels can work with information. Pick the right tool for job, absolutely. Just be sure to take into account the relative capabilities of the respective tools and the true nature of the problem at hand. There’s a ton of great analysis of semi-structured data, like query refinement interfaces, that works better from within a database (albeit not an RDBMS – you need a newer generation semi-structured database) than outside the database with MR or other custom coding.

  • Share/Bookmark
Posted on January 18, 2010 at 4:27 pm · Permalink
In: databases

2 Responses

Subscribe to comments via RSS

  1. [...] This post was mentioned on Twitter by drewvolpe, Adam Ferrari. Adam Ferrari said: MapReduce for semi-structured data analysis? http://bit.ly/5CKj9k #searchfacets [...]

  2. Written by Paul
    on March 16, 2010 at 12:42 pm
    Permalink

    Have you looked into DryadLINQ. I am currently doing a report on it for my graduate class. It seems to be more general than MapReduce and uses .NET’s LINQ (Language Integrated Query) to give you some declarative types of operations in C# or VB that mimic select and where SQL statements. I believe LINQ can also integrate with popular DB systems. I have just researched it. No hands on experience.

Subscribe to comments via RSS

Leave a Reply