Steve Hoberman

What is Unstructured Data through the Eyes of the Data Professional?

By Steve Hoberman on October 18, 2010
View Full Bio →

Recently I purchased software to help organize my collection of digital photographs. This software had some neat features including the ability to assign tags to each photograph. Tags can help describe the photograph such as who is in it and where was it taken. The software even comes with a reporting system, answering questions such as “How many photographs contain the tag ‘Jamie’ (my youngest)?” (The answer by the way is over 4,000!)

Being a diehard data modeler, I stared at the screen after assigning the tag ‘Jamie’ to her four thousand and first photograph, and wondered this: what does the data model look like behind this software? I started sketching boxes and lines and realized I came to a decision point where I either needed to model the photograph itself or the contents of the photograph. The photograph itself seemed to me as unstructured, yet the photograph’s components appear to have far more structure. I then started asking myself, “What is unstructured data anyway? What is the difference between structured and unstructured?”

A quick search on the Internet turned up many computer-type definitions, such as this one from Wikipedia: Unstructured data refers to (usually) computerized information that either does not have a data structure or has one that is not easily usable by a computer program. Definitions like this have little value to us as a data analyst, data architect, or data modeler. In fact, I could not find a definition that seemed relevant to the field of data management – so…I wrote my own.

Before defining unstructured data, we need to understand the concept of a classword. A classword is the last term in a data element name which defines the high level domain in which the data element belongs. A few examples of classwords are Quantity, Amount, Name, and Code. For example, the data element Gross Sales Amount contains the Amount classword which implies a currency such as US dollars.

Some classwords are simple and some are complex in nature. Simple classwords can only be broken down further through the process of normalization. Simple classwords include Amount, Code, Date, Identifier, Indicator, Name, Number, Percent and Quantity. For example Customer Name can only be broken down further by normalizing Customer Name into Customer First Name and Customer Last Name but this is all within the same classword of ‘Name’.

Complex classwords include Text and Object. Examples of data elements containing the text classword are Order Comments Text and Email Body Text. Examples of data elements containing the object classword (including photographs, music, pdf files, and voice conversations) are Driver License Photo and Book Licensing Agreement. Complex classwords can always be broken down into smaller pieces, such as completely new data elements with simple classwords. The data element Photograph Object for example, can be broken down into Photograph Taken City Name, Person In Photograph Name, and Photograph Caption Text. We can even go further and include simple data elements such as Dots Per Inch Count, Film Speed, and Photograph File Size.

For me as the data modeler, the distinction between structured and unstructured data boils down to a distinction between simple and compound classwords. To be more specific: Structured data is any data which can be referred to with a simple classword. Unstructured data is any data which can be referred to with the complex classwords of Text or Object. Customer Last Name is structured. Order Comments Text is unstructured.

You’re probably wondering, why even distinguish structured from unstructured? If your job function includes analyzing, architecting, modeling, or designing information, there is an impact to your job when the data is unstructured.

Stay tuned for my next blog where I will explain how unstructured data impacts us, the data professional. Until the next blog… 

Follow all Expert Blog updates by subscribing to the RSS RSS feed.

About the Author

Steve Hoberman is one of the world’s most well-known data modeling gurus. He understands the human side of data modeling and has evangelized “next generation” techniques. Steve taught his first data modeling class in 1992 and has educated more than 10,000 people about data modeling and business intelligence techniques since then.

Bob Conway
November 2, 2010

Steve, I believe you are right to separate the ‘content’ from the ‘strucutre’ of so-called unstructured data.  A pie chart in a PPT certainly qualifies as unstructured data.  However, simply tagging bibliographic data such as the name of the PPT file, the author of the presentation and the slide number renders marginal value for information retrieval when the pie chart content is ‘about’ 4Q2010 Sales Forecast by Product.  Abstracting what the unstructured data’s content is about, is very difficult but the most useful for defining its semantics.  Just like an (old) library card catalog, it does not tell as the ‘content’ of the books’ but rather provides an algorithm for retrieval.  The semantic value of the book only comes from the hueristic of cracking the pages and reading it.  Much of companies’ unstructured data (PPTs, XLSs, DOCs, PDFs) content is far downstream from their source-of-record systems and their semantics have been dramatically altered by hueristic interpretation.  Integrating unstructred data content with traditional relational data content maybe tilting at windmills.  For the near future, the best we may be able to do is standardize and ‘structure’ their bibliographic data so that users can narrow their retrieval and apply hueristic interpretation.

Steve Hoberman
November 8, 2010

Bob, you definitely suggest a very practical step to deriving some of the value from unstructured data. I can see after implementing your suggestion that one can take this organized set of information and then start cracking the pages (I really like this phrase!), and getting more out of this analytical goldmine.

Thanks for reading, and thanks for your comments!

Steve

Name:

Email:

Comment:

3 plus 11 is equal to?

Notify me of follow-up comments?