Vertical stores for vertical web search?
by Adam Ferrari
Adam Ferrari
Before the holidays I made it over to MIT for a talk by Michael Stonebraker about his latest startup, Goby.com, a vertical search site focused on leisure and travel. Always up for a Stonebraker talk, I was particularly keen to see this one given the combination of elements – a true pioneer of database management systems discussing his first foray onto the consumer-oriented web. The talk was indeed interesting, but not in the way I anticipated. What I expected to get was systems-level observations informed by the experience of powering a consumer web site. What he actually covered, for the most part, was a discussion of the issues involved in delivering a good vertical search experience.
One highlight: Stonebraker says that in appropriate verticals such as travel, it’s cost effective to deliver a highly differentiated search experience relative to a traditional web search. The economics work out that you can employ offshore development to create semantic data extractors to do high-quality structured aggregation of content distributed across the web.
The conclusion he drew: expect to see a proliferation of high-quality vertical search sites eat away at Google’s search dominance by delivering much deeper, high-quality search experiences in their respective domains. [And here’s Daniel Tunkelang’s take on this topic.]
Very fun stuff to ponder, and it will certainly be interesting to see how it plays out. But as a systems guy I couldn’t help wishing for more technical observations about the DBMS tier. In fact, Stonebraker had quite the contrary to offer, mentioning that the server side of the site is simple and vanilla, based on standard Postgres (no surprise there).
The one glimmer that there might be anything challenging about the server side he did mention is that they’re considering moving to Vertica (even less of a surprise) to better handle their performance needs, and to manage the heterogeneity and sparseness in their data. The argument here, which I’ve heard many times before, is that column-oriented RDBMSs are effective for solving search problems on heterogeneous/sparse data. You can represent all of the entities of interest as one big table, and the column store will handle the storage of sparse data very efficiently given its ability to compress out all of the NULLs (no question about that). Furthermore, the column store lets you power features such as faceted search very efficiently given the amazing scan rates it can provide (again, no question about that).
Continuing the theme of “no surprises,” despite agreeing with some parts of the argument, I think it reaches an incorrect conclusion – namely that with the right underlying implementation, aka a column store, a relational database is the right technology to power a search application.
There’s an important assumption that this argument glosses over, but that causes the approach to become very problematic as data complexity increases. It assumes that the application knows which columns are relevant for a given search. For example, if I search for “hiking trails” on Goby, I’d very much like to be able to refine by characteristics of the trail such as length and difficulty. If I search for “parks,” I’d like to be able to refine by regulations such as whether camping or barbecuing is allowed (if you know me, then you know that the barbecue example is, you guessed it, no surprise). Or more interestingly, if I perform a search that brings me back a combination of outdoor activities including both hiking trails and parks, there are cross-cutting facets that I’d like to present such as whether the location is dog friendly, and what types of activities are available.
Many sites based on traditional RDBMS back-ends attempt to address this need by pre-coordinating the presentation rules. That is, the site owner makes the editorial decision, in advance, that when the user navigates to the hiking category, certain fields should be shown, and in the parks category, a different set.
Already that kind of pre-coordination adds management complexity. But what happens when the user doesn’t navigate cleanly to a specific category, but rather performs an ad-hoc search? How will we know which fields are appropriate?
The “one big table” model doesn’t accommodate this heterogeneous data requirement. Attempting a “GROUP BY” on every field and seeing which ones come up non-empty simply doesn’t scale if you have a truly wide table. And very wide, sparse data sets are abundant in the wild. For example, many retail catalogs have hundreds of product categories with tens of attributes each, and many parts databases contain hundreds or thousands of part categories each with hundreds of technical attributes. Even a vertical search domain like travel can scale out rapidly as you capture more types of entities, extract more of the attributes about each (which, after all, is a big part of the value-add of vertical search relative to basic web search), and factor in related content such as user ratings and reviews.
Of course, the final non-surprise for the post is that there is in fact an approach that handles data heterogeneity very effectively, and it has important implications for the systems side of things: namely, the use of a native, semi-structured data model such as XML or RDF. These types of data models allow you to model heterogeneous data very naturally, without the need for a clunky sparse/wide table that masks the true structure of the data from the application.
More importantly, these models let you efficiently deliver an ergonomic user experience. For example, if my records are XML and I want to figure out which top level element names (analogous to column names in the big table) are present, the path expression to do so is trivial. Effectively, I’m able to query the metadata and the data on equal footing. And furthermore, as has been illustrated well by work like the Pathfinder/MonetDB project, semi-structured data is not at all incompatible with decomposed (i.e., column-like) storage model.
This observation about the power of semi-structured data to accommodate heterogeneity is nothing new – for example, the Lore project at Stanford in the mid 90s described these concepts. Stonebraker has certainly been very critical of the applicability of semi-structured data in the past (for example, his “What Goes Around” paper leads off the “Red Book” that graces every database student’s shelf). But some of his recent writings hint that he’s become more receptive to semi-structured data approaches.
It was disappointing to see that his detente with the semi-structured world doesn’t extend to actually applying the technology in one of its absolute sweet spots – rich search applications. Goby is clearly going in a good direction, so when they are a premier web destination with crazy traffic and massive growth in data scope, perhaps Mike’s follow-up talk will get to this topic.
on January 31, 2010 at 2:54 pm
Permalink
[...] the posts so far are nice and meaty. I particularly like Adam’s post about “Vertical stores for vertical web search?“–it’s nice to see read intelligent analysis from someone who understand the [...]