Posts From This Author
About Our Authors
Opetarational Data, Hadoop and New Modeling
By William McKnight on March 7, 2011View Full Bio →
The top end of data scale is growing at unprecedented rates for many companies. Google, for example, generates petabytes… a day!!! Most of that data is relatively unimportant, but each bit contributes to multiple aggregations that enhances profiles and otherwise contributes to operations. Each bit may also be a ‘gem’ that should drive a business process or be interesting to a batch process. And finally, just finding a way to actually store the data will allow for future processing on that data, should it be necessary. If any of the data is thrown away, that’s value lost to Google. It’s operational petabytes or opetarational data.
With high-profile acquisitions, partnerships and tool deployments, Hadoop is gaining some legs in information management. MapReduce does its parallel processing. Google inspired Hadoop and its becoming the Apache Hadoop it is today. Hadoop is not a data warehouse replacement. This data is destined for a different workload, summaries of which may be loaded into a data warehouse for a data warehouse class of workload.
It’s a new scale of problem that requires new solutions.
Hadoop is one of many categories of solutions that are attempting to address these problems. No one has a (working) crystal ball and indeed Hadoop has many detractors. I’m not predicting today, just knowing that many are experimenting with and committing to Hadoop as a solution.
When explaining this concept several times recently, it is clear that, for many, the need is not to the scale yet. Massively parallel RDBMS manages the interesting data today and for the forseeable future. This class will also surely scale its capabilities. The “hyper-parallelism” that many data warehouse appliances deploy is actually a similar concept to Hadoop, only Hadoop is using hundreds to thousands of computers for its parallelism. The appliance process where (for example, Netezza who is interestingly offering an integration solution with Hadoop) the SMP host will provide a final aggregation and any merge sort that is required is similar to the MapReduce processing. Feeds to a data warehouse are also a MapReduce function. Google runs about 1,000 MapReduce jobs per day. And sure, “hardware is cheap”, but the data increase in Fortune 500 organizations is still sending hardware budgets north at a rate that is gaining increasing scrutiny.
Hadoop has high profile implementations at Google, Yahoo, Twitter and Facebook – linchpins of the emerging economy. I’m looking into Hadoop for multiple Fortune 50 companies. Disney is public with its Hadoop implementation. These companies may be able to meet their high-end needs with RDBMS but want to store and process all their data and do it cheaply. They toss away data regularly. They also realize that not a large percent of the RDBMS functionality is interesting on the most granular of data.
Could the need for something like Hadoop be coming to your shop and what are the implications on data modeling? I believe the need is coming as structured, human-generated data gets well under management and having competitive advantages demand near real-time access to machine-generated data, sensor data, clickstream data and unstructured data like logs and websites increase. Many will join the handful of companies today that are doing serious processing beyond a traditional RDBMS.
Data modeling is never more important and it will continue to be important, but the physical data modeling stage of modeling is becoming increasingly important. This is because in order to capture opetarational data, not only must new approaches like Hadoop be used for the data capture, but the modeling must accommodate the speed requirement of the loading. And, sad to say, the speed of the modeling effort must be considered as well. Hadoop data may never be queried! So it may be more effective overall to store the data in something close to original form and allow Hadoop the occasional access-on-demand to the data.
It will be a re-balancing of data modeling energies from the business-focused structuring of today. Hadoop captures data in a state that many modelers today would consider very raw. It’s a different mindset when you consider that the consumption profile is not necessarily a high number of user queries with a modern business intelligence tool (although many access capabilities are being built for those tools to Hadoop) and the ideal resting state of that model is not necessarily dimensional.
There are new data access tools like Hive – an SQL-like tool that generates MapReduce code – and Pig Latin. I remember when Pig Latin was sounding “ib” in front of every vowel, but that’s a different Pig Latin. This Pig Latin was developed by Yahoo! to address data transformation without having to write MapReduce code, which is based on Java.
Information professionals face an interesting set of new challenges as data needs escalates and a real-time business utilizing all information possible becomes a requirement. Is Hadoop disruptive? Time will tell. For now, it’s a developing solution to opetarational data and brining about a change in modeling approaches.
Follow all Expert Blog updates by subscribing to the
RSS feed.
About the Author
William functions as Strategist, Lead Enterprise Information Architect, and Program Manager for complex, high-volume full life-cycle implementations worldwide utilizing the disciplines of data warehousing, master data management, business intelligence, data quality and operational business intelligence.
This is a facinating article. If you ar ever in the New York area please let me know, We would love to have you speak about it at NYEMUG, the New York ERwin Data Modeling User Group. We re one of the largest local ERwin user groups and would like to add you to the list of great speakers we have had in the past like Steve Hoberman, Graeme Simsion, Barbara Von Halle, and many others.





















May 13, 2011