Steve Hoberman

How does unstructured data impact us?

By Steve Hoberman on November 12, 2010
View Full Bio →

Welcome back. In my last blog posting I offered my definition for ‘unstructured data’ and in this blog posting I will explain how unstructured data impacts us, the data professional. In general, it is more work for us to do, but also more challenging work and therefore more fun smile.

Recall from my last blog posting where I made the distinction between structured and unstructured data as a distinction between simple and compound classwords. Structured data is any data which can be referred to with a simple classword. Unstructured data is any data which can be referred to with the complex classwords of Text or Object. Customer Last Name is structured. Order Comments Text is unstructured.

If you have the word ‘analyst’ somewhere in your job title, or you perform analysis work such as requirements gathering and high-level data modeling, there is an additional question you will need to ask when encountered with an unstructured element: Do you need to see this ‘thing’ or something about this ‘thing’? In other words, does the user need to see the unstructured data in its raw form, or is there something within (or derived from) this unstructured data that needs to be seen?

Let’s take an example. Let’s say there is a photograph. The user can require to see the photograph or something about the photograph, such as the number of people in the photograph, the number of photographs that contain ‘Bob Jones’, the speed of the film, the lens aperture opening, the dots per inch, where the photo was taken. You get the idea.

Let’s take another example – email. Would the user want to see the email message in its entirety, or just something about the message, like does it contain certain action words like ‘steal’, ‘embezzle’, or ‘borrow’? Or certain product names? Or certain emotional words like ‘love’ or ‘hate’?

If you have ‘modeler’ somewhere in your title, also be prepared for more work. The extra work will take the form of crafting a creative design based on the output from the analyst. For example, with the photograph example, assume the user needs to know the number of people in the photograph. The modeler needs to decide whether both the photograph and Photograph Person Count should be modeled, or just the Photograph Person Count, or maybe some type of generic photograph demographics structure to allow for future expansion. With the email message, should we store Email Body Text or just counts of certain words or tags?

I think if you have ‘modeler’ somewhere in your title, you will also need to become more familiar with taxonomies, knowing which ones are out there in the industry or in your company that you can reuse, and perhaps modeling some new ones from scratch. These taxonomies will help bring order to the unstructured data.

If you have the word ‘architect’ somewhere in your job title, you may need to prepare the architecture you work within, such as the data warehouse environment, to accommodate additional types of metadata. What formats and best practices should be used to capture photographs, music, video, and unformatted text? Are there text analytic tools we should research? Is it a best practice to store the photograph in a file system and point to it via a hyperlink, or store the photograph directly in the database? Should we even allow jpegs, pdfs, etc., in our data warehouse? These questions will need to be answered. I had an opportunity to preview Bill Inmon’s most recent book, Building the Unstructured Data Warehouse, and he covers quite a lot of things to think about when analyzing text from architecting the data warehouse down to indexing strategies – I picked up some important ideas from this book, especially on indexing. The architect (or modeler in some cases) will need to come up with guidelines on which type of index (or combination of indexes) is appropriate for each situation.

So quite a lot of new work and opportunities has arrived, or will arrive shortly. Stay tuned for my next blog where I will discuss how to model XML… 

Follow all Expert Blog updates by subscribing to the RSS RSS feed.

About the Author

Steve Hoberman is one of the world’s most well-known data modeling gurus. He understands the human side of data modeling and has evangelized “next generation” techniques. Steve taught his first data modeling class in 1992 and has educated more than 10,000 people about data modeling and business intelligence techniques since then.

Tom Bilcze
November 17, 2010

Another great post! The topic of modeling unstructured data has been around for some time. It’s just that people never thought of it as unstructured. I like your distinction of it most likely occurring where the class word is text or object.

The example that comes to mind for me is handling freeform comments. So many web apps capture comments or descriptions. Regulations, auditability and better customer experiences all benefit with the implementation of these large blocks of unstructured data. The challenge that arises is when business users say the business requirement is to find a specific word (hate, love, embezzle) in all comments and descriptions and return them in a specific manner. If your company is of any size and the database is large as are most mission critical apps, the query will either blow up of go out for a smoke break and maybe possibly return later.

Finding the creative solution to the above scenario takes some modeling effort and most likely even more application design and implantation time. Associating keywords to the unstructured data eases the pain and is a typical solution I have seen. I really wonder if technology needs to catch up to satisfy the need for structuring unstructured data in the database. In other words, how can DB2, Oracle or other DBMSs query unstructured data efficiently without much additional design intervention?

Steve Hoberman
November 19, 2010

Great point Tom. The DBMS world still seems quite distant from the text analytics world. The good news is that more and more products that search for specific words and do sophisticated text analysis are starting to appear, so hopefully these tools and features will at some point make their way to the relational database world.

Sampath Kumar
December 2, 2010

Great one.Even though we are structuring data as much as possible but still unstructured data growing everyday in every form and we as data modeler need to accept that it cannot be avoided and it should be modeled very carefully.

Name:

Email:

Comment:

3 plus 11 is equal to?

Notify me of follow-up comments?