Monash, are there really only three kinds of data?

by Adam Ferrari

Adam Ferrari

Is my data structured, unstructured, or semi-structured? Curt Monash provided yet another take on this never-ending data management question in a blog post earlier this week. This topic has generated tons of discussion over time, but despite this, common perceptions out there seem fairly straightforward. Basically, common wisdom holds that:

  1. Structured data is the stuff in relational databases. Unstructured data is everything else. And,
  2. Relational databases and BI are how you store and analyze structured data, respectively. Search and other more specialized systems are how you work with everything else.

Sure, most people would probably concede that this model is lossy. And the world is evolving – notably, the NoSQL movement has been vigorously calling into question the idea that RDBMSs are always the right tool for working with structured data. But as a first order approximation, that seems to be how most people think about “flavors” of data.

Curt’s post does a nice job of bringing in some additional wrinkles, such as whether data is machine-generated or human-generated. This question has important bearing on the quantity and quality you can expect in a data set, and how you might approach issues like storage, retention, and analysis. But within the “human-generated” domain, he holds onto the traditional structured versus unstructured distinction. Paraphrasing, the distinctions he draws between the two types of human-generated data  are:

  1. Structured data fits into tables, and (with appropriate management) can be viewed as complete and accurate, and
  2. Unstructured data is everything that doesn’t fit into tables. It often deals with communication, opinions, and judgments, and therefore it’s inherently more incomplete and fuzzy.

There’s an excellent and important point here, which is that different data sets do in fact have differing levels of completeness and accuracy. And this is definitely tied to the processes of their creation and management. For example, if I’m running my ERP system and associated business processes well, I would hope to have a complete and accurate view of all of my sales transactions. But on the other end of the spectrum, the discussion forums in my developer network cover whatever topics people felt moved to talk about it, and although there’s a ton of valuable and accurate information in there, there are also certainly points where people injected opinions, got stuff wrong, etc.

But there’s an important point where I differ with Curt’s analysis, namely the coupling of notions of structured and unstructured data types with notions of accuracy and completeness, and more importantly to the types of analysis that you might want to do on the data. There are tons of real and important data sets that break the model – ones that are complete and accurate, but that simply don’t fit conveniently into a tabular representation. A good example that I see often comes from manufacturing and supply chain applications where you have “part description records” with lots well-structured technical attributes (e.g., maintained in a PIM system, augmented with commercial data feeds, etc.). Sure, you can torture this stuff into tables, but given the sparseness, heterogeneity, hierarchies, etc., it’s not easy. You either end up with either a very complex schema or a very opaque one (e.g., something like an EAV model). Neither of those options is convenient for someone trying to query and analyze the data in unpredictable, ad-hoc ways.

It’s too limiting to put data sets into a simple taxonomy of structured, unstructured, and semi-structured. As Curt suggests, it is valuable to think about data sets based on how they are produced and managed, looking at characteristics such as their completeness and accuracy. But we need to consider separately how best to store and query any given data set. Let’s not try to force-fit complex, sparse data into a tabular form when a semi-structured model is a much cleaner and simpler fit.

  • Share/Bookmark
Posted on January 21, 2010 at 11:25 am · Permalink
In: databases

2 Responses

Subscribe to comments via RSS

  1. Written by Dan Barbata
    on March 4, 2010 at 8:18 pm
    Permalink

    Is there any disadvantage you can see in trying to bring structure to unstructured information? By one way of thinking, the more structure you have the better. Wondering if you agree?

  2. Written by aferrari
    on March 5, 2010 at 5:05 pm
    Permalink

    Dan, Thanks for the comment. I absolutely agree that there’s a ton of value in enhancing unstructured information by adding more structure! And this can be accomplished in many ways – for example, like text analytics like entity and sentiment extraction, or more basic approaches like joining information from structured and unstructured sources together. In fact many of these techniques combine quite nicely – for example, in an enterprise setting I might pull customer and employee names out of unstructured text fields, and then join on structured data from my CRM or HR databases. But these approaches really work best with a semi-structured data representation as the target as opposed to a relational store. For example, if I’m doing entity extraction it’s hard to tell which types of entities I will find, and in what quantities. If I can pull structure out of unstructured information, there are many incredibly valuable user interaction features that we can power, like faceted navigation, type-ahead search, structured range filters (e.g., geographic filters if I can pull out location data), data visualizations like tag clouds, and so on. But we need to be targeting the right kind of data model when we take these approaches; otherwise they become limited by schema complexity and commensurate query evaluation slowdowns. -Adam

Subscribe to comments via RSS

Leave a Reply