Hadoop + Hive + Endeca, Spotted in the Wild

by Pete Bell

In his post MapReduce just semi-good for semi-structured data, Adam Ferrari answered one of his FAQs about the relationship between Endeca and MapReduce, the popular big data cruncher. Now here’s one example of them complementing each other.

The question Adam answered was, if MapReduce is so powerful for processing big data, then what role does Endeca play?

By way of background, MapReduce is “a software framework for distributed processing of large data sets on compute clusters,” which is itself a sub-project of Hadoop, “open-source software for reliable, scalable, distributed computing.” They take parallel processing that was once rarefied because it required esoteric dev skills and expensive hardware and make it accessible to people with mortal IT skills and cheap hardware.

Adam answered that the details matter. What kind of data are you crunching, and how do you want to query it? For example, if you understand the structure of the data and know how you want to query it, MapReduce is perfect. On the other hand, if you have heterogeneous, semi-structured data, then we know empirically that you likely won’t know in advance how you want to query it, so instead you’ll need to explore and refine it. Endeca fits that use case.

Another complement to Hadoop is Hive, “a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.”

Taken together, Hadoop, Hive, and Endeca can give you an Agile BI solution for big data.

And in fact, Vinay Mohta, a product manager at Kayak, the vertical travel site, has been blogging about this very use case. Vinay is a perfect early adopter because he’s an Endeca veteran, having served as a both a core software architect and a product manager. From his blog:

I’ve been using Hadoop and Hive for the last six months and have been pretty impressed with how well it works.  To state the obvious, if you can correctly formulate your query, nothing beats this approach.  It’s been very useful for doing cohort analysis and large scale lifetime value computations on a relatively high traffic site.  There are of course limits to what you want to keep in Hadoop / Hive; however, the convenience and the growing feature set are reducing that limit more and more.

Hive is not a good store as a backend for a BI product, since it offers no caching at all.  However, a workflow where you crunch data in Hadoop/Hive and then export to a MySQL table (or an Endeca instance) for use in a BI tool works very well.

Vinay’s not the only one. We’ve heard from quite a few customers that have Hadoop and  Endeca together in their workflow. These are fun to track because once people are in an Agile workflow, they inevitably invent new use cases.

I’d love to hear how you’re using it. I’d also like to know your motivation. Is it because it’s a quick path to Agile BI, or is there a qualitative difference between these new tools and a traditional enterprise data warehouse?

  • Share/Bookmark
Posted on June 25, 2010 at 3:09 pm · Permalink
In: BI, databases

Leave a Reply