How similar are faceted search and OLAP? See CIO Mag: 20 to Watch in 2010
by Adam Ferrari
Adam Ferrari
Endeca received a nice spot in CIO magazine’s new list, “Twenty companies to watch in 2010.” Compared with much of the coverage we get in the press, which tends to focus on one of our core market solutions such as Retail or Manufacturing, I was pleased with the breadth that CIO was able to touch on in their blurb about us:
Endeca (www.endeca.com) is fast becoming the Google of enterprise search for e-commerce sites. The company is much beloved by customers as a fast and friendly way to release business information, much of it turning up surprising discoveries. It offers perhaps the friendliest way to do ad hoc BI and customer analytics.
In particular, as a technologist I’m happy to see e-commerce and BI mentioned in such close proximity. From the outside, especially for people who aren’t familiar with our technology, this combination of problem spaces might seem surprising. But if you dig in at a technical level, this combination of problems not only makes sense, it’s where information access has to go.
To back that claim up, I’d point to the central concept that our MDEX engine was built upon: the ability to provide interactive-speed aggregation and analytics of semi-structured data. For example, we use a decomposed storage model (similar to analytical RDBMSs) to allow high-speed data scans, but without imposing the rigidity of a relational schema. Similarly, we overlay a dimensional data model on our data store to power packaged query analyses, like extremely scalable faceted search, dimension search, and OLAP-style analytics, but again without the rigid data modeling constraints of a cube. What does this mean? If your data is complex and messy (in other words, hard to model efficiently in a relational schema), but you want to do ad hoc guided analysis on it with interactive performance, this is the right platform.
The tie-in from this to the BI mention in the CIO article is pretty obvious. When I look at applications we’re powering like fine-grained customer segmentation or warranty analytics – applications where traditional BI fell down but we were able to save the day – the need for Business Intelligence without the traditional data modeling boundaries imposed by relational databases is pretty clear. But what does that have to do with the needs of a market like e-commerce?
There are common problems faced by two of the most important constituencies for both those domains: the end users, and the IT staff charged with making the application go.
Like it or not, online shoppers have a BI problem on their hands. They need to make hard decisions, and they want to use all of the data available to them – product information, social content, editorial content, and so on – to figure things out. Great search and relevance are very important – if I’m looking for something specific, and you can’t correct my atrocious spelling (no kidding there!) and pull the best (i.e., most relevant) bets to the top of the page, I’m not going to be a happy shopper. But what if (again no kidding) I don’t know exactly what I want? How am I going to figure it out?
Faceted search is an important part of the answer. I need to see the various axes that characterize the space of available options, and explore the implications of choices (for example, picking that large screen size really changes my available price options – maybe I’d be OK with something a little smaller!). And by doing that, I not only give the system more clues about what is relevant to me, but I also figure out for myself what I care about. The search experience teaches me about what characteristics of the data I find important, building up a rubric that no relevance algorithm could possibly have guessed at the start.
This characteristic of faceted search makes it a vital tool for both online search and for enterprise analytics. But the linkage between these problems goes much deeper. For example, although it’s clear why I want to be able to rapidly add up total spend aggregated along various dimensions in a supply chain analytics application, it may not seem like I need that kind of number crunching in an online search application. But increasingly that need is showing up. One clear driver, which may come as a bit of a surprise, is relevance.
As a simple example, consider a travel site where I can search for lodging. A good vertical search site in that domain would know a lot about each hotel – categorical information like various amenities, but also quantitative scores that might rate the hotel along various axes, price, proximity to good restaurants, proximity to the airport, décor, business traveler friendliness, kid friendliness, and so on. And by observing a customer’s behavior over time, the site could begin to estimate that customer’s affinity for various hotel characteristics. Using this data, I could compute a weighted sum of factors for each hotel matching a user’s search, plugging in their personalized coefficients, and blending the result with other relevance factors. Now instead of a one-size-fits-all best bet, I’m computing a personalized ranking.
In the spirit of HCIR, I could even give the user visibility and control over this, e.g., sliders to adjust my coefficients, with the initial settings guessed from my past behavior, so that when I switch over from my more typical business traveler mode to book a trip with my family, I could let the system know.
To deliver the above example, I’d need a high-performance analytics engine tightly integrated with my search engine (tightly integrated as in a single, unified data store and query executor, not tightly integrated as in an RDBMS and a search engine glued together, which would never scale). I want the modeling flexibility of semi-structured data, but the analytic efficiency of a column store.
That’s where the IT constituency I mentioned above comes in. Today’s BI practitioners are being asked to deliver a more consumer-friendly analytics experience with integrated search. Meanwhile, developers of online search applications are being asked to do more with structured content to deliver better relevance and more ways for users to interact with data to make better decisions.
Much like the old Reese’s commercial that depicts the apocryphal origins of peanut butter cups, it might seem strange that the same engine would mix faceted search chocolate with BI peanut butter. But today’s search and BI applications that cope with complex data and provide ad hoc access increasingly want the same peanut butter cup. And these problems provide great and growing opportunities for cross pollination. BI wants to become more consumerized and user friendly to gain adoption. And search applications want analytics and visualizations. The insight that these two problems are in fact two sides of the same coin informs a unified architecture that tastes great.
on February 2, 2010 at 3:25 am
Permalink
Here’s a WSDM paper by several researchers from IBM (Ben-Yitzhak et. al.) about extending search facets to an OLAP framework (PDF):
http://nadav.harel.org.il/papers/p33-ben-yitzhak.pdf
on February 4, 2010 at 5:57 pm
Permalink
Thanks for the highly relevant pointer. This comes at the question from the search-oriented point of view, looking at how to implement OLAP-style functionality on top of a search index. I find the complementary perspective is also important to consider, namely building search applications on a more structured data store. That angle leads to some different decisions. For example, this approach comes at the problem of modeling distinct data entities by organizing them into trees. But that’s where some of the early database models like IMS ran into data modeling problems, which then gave rise to the more flexible relational model. There are lots of good reasons to hold onto the lessons learned in the structured data world while approaching the problem of blending analytical capabilities with the flexibility of search. -Adam
on March 1, 2010 at 3:00 pm
Permalink
[...] The particular aspect of these discussions that I found so striking: the people I talked to came from an amazing breadth of organizations, yet they faced the same fundamental business challenge and underlying technology challenge. Each owned the goal of delivering information to his organization to increase its competitive advantage. And this meant that each faced the technical challenge of enabling end users to understand information too complex and messy to fit into relational BI schemas, via analytics and visualizations that are outside the scope of what search engines provide. They needed a hybrid of search and BI, a convergence that I’ve written about before. [...]
on March 9, 2010 at 11:18 am
Permalink
[...] far, this should be familiar to anyone tracking this space. But here’s the interesting part. Who initiates the purchase of an intelligent workspace? Sue [...]
on May 12, 2010 at 8:46 pm
Permalink
That IBM paper is pretty interesting, and it does explore a frustrating problem regarding counting items correctly if one desires to show only *one* canonical product to the user regardless of query. So “red shoes” might show red shoes in a per-sku database, but “shoes” shows a default color. Exactly once for all variations.
The user will see the “cartesian product” of color variations unless this is addressed somehow.
So “shoes” is the potentially big problem. And even if they do type “red shoes,” that doesn’t eliminate the “cartesian products” in another dimension. Ack.
You have 3 choices here:
1. Ignore the problem and let the user filter for red after the fact. That may work a lot of the time. Pretty well, but it depends on the number of variants. 5 laptops? Maybe. 24 shoe sizes x 3 colors = 72 shoes per shoe on my screen in perhaps no particular order? Ouch.
-or-
2. Use rules to extract the facets beforehand, and hope you get it all correctly. This probably won’t work much of the time. As an aside, it would be nice if selecting “Red” as a dimension would implicitly s/red/null/, if exists, in the search query, though. Maybe that only bothers me to see the word in both the filters and the text query.
3. Do something really slow (what they’re citing as a flaw in the RDBMS model), and remove duplicates in data. Then count it. Ouch. You’d assume computers are always good at *both* finding duplicates and counting, but when you’re dealing with n! sized lists, they don’t shine as much.
Of course this is not *always* desirable. And not everyone does it. I doubt most electronics sites want it, as most people see a red laptop and a black laptop as a different SKU. Best Buy doesn’t do it, B&H doesn’t do it, and J&R doesn’t do it. I think this decision is correct.
The latter 2 appear to use Endeca. They’re fabulous, too
I can find anything I want on B&H.
Clothing is way more iffy. Enter Lands’ End. They’re a great example of this situation in action.
http://www.landsend.com/ix/index.html?store=le&action=newSearch&search=red+sweaters
(a ft search query for “red sweaters” — brilliant — although they do not select the color automatically, which is an odd omission.)
This is doable within a category (i.e. not a search), because it’s a finite number of products, and the indices can be built to take this into account. i.e. instead of representing each product-variant and its attributes in an individual cartesian-product, build the index with its parent product and the aggregation of all attributes. It’s not clear that you can get the right product from an index like that, but you also know which facets the user has applied.
On the other hand, you probably *do* want a URL per product variation for the big G, and to benefit users as well as the big G, you’d want to generate the URL for the proper SKU. That way users get to the best page, and bots get all the SKUs. That’s tricky, but not impossible. Actually, it’s a tree. but unless you have tons of variations this can be handled quickly. A tree would be 100% efficient, though.
It’s much trickier for text queries.
I believe this is (part of?) what the paper is attempting to address. And it’s a clever idea. I’m not a PHD in CS, but I see some potentially dubious things in that paper as well:
1. Is the large index size contention is not a big deal? The attributes are integers unless you’re actually storing “forest green” … “forest green” …”forest green”. OK. Don’t do that. Most column-oriented/decomposed engines will do this for you quietly. On top of that, they typically have a way to skip over large swathes of data by storing the MAX and MIN of the column at intervals.
Many RDBMSes have compression methods that shrink tables and indices pretty impressively for repetitive structured data. They listed it as shortcoming #1, but a tree is a lot of complexity just to eliminate some repetition.
And “The study of the associated optimization problem that receives as input a set of documents with
associated facet values and computes their most compact tree representation is beyond the scope of this paper.”
Right
2. “Table 3″ looks totally wrong to me unless they’re using an index I don’t understand. They said nothing about bitfield indexes, though maybe that’s what they meant.
They’re correct on the issue that an FTS DB will repeat itself, and that’s probably a bigger problem. This is where that tree idea starts to make more sense to me.
The tree structure also shines if you’re trying to get a grouped count of all matching rows in the midst of a text query. Otherwise, like they say, you’ll be left with duplicates, and that can get really, really slow.
One could maybe “hack it” without a tree, and simply build the text index with 1 row for all product variants without any structure. However, then you’d always be selecting an arbitrary product regardless of relevance. Only some dimensions matter, but it’s still a problem.
In this case … you want that tree?
I’m not sure that column-oriented databases solve this problem.