William McKnight

MDM and Big Data, from Batch to Real-Time

By William McKnight on January 22, 2012
View Full Bio →

“Big data” describes data that has the potential to grow much more rapidly than its small brethren data.  While structured, alphanumeric data can also be “big”, it’s the unstructured sensor, web and social data that have some cost-effective methods for storage and retrieval that may or may not be database oriented.  

When taking any new data under management, it is essential to get the most out of that data.  The value of all data goes up exponentially when it can be coupled with corporate master customer, products, sites, parts, etc.  Master data management (MDM), the discipline to build and deliver these masters to the enterprise, is one of the best means to compound the value of any data, including big data.  MDM is fundamental support for big data.

Scans (i.e., Map Reduce) through big data typically have a very high number of records to triage.  Scans of big data scans involve big data.  As such, they have little wiggle room for performance drags, yet many need master data joined to its high-volume transaction data. Analyzing transactions is hardly complete without understanding the characteristics of the people, products behind the transactions. Yet, big data analysis so far mostly ignores these characteristics and only learns about product movement, how unknown people traverse their site or about whatever else happens to be on the log itself.

Joining this data to less than fully governed high-quality “master” data runs a high risk of a high degree of change in what the organization believes to be the master.  It also runs a high risk of change in the data itself, since the originator did not take the care to make it an enterprise master.  It’s just data you decided to rely on.

MDM systems are built knowing they will contain the master.  There are capabilities for building the master data which involves both automated and human input at appropriate points.  There are capabilities for cleansing the data to an enterprise standard.  There are capabilities for managing the data securely and for distributing the data to multiple points in real-time.

Also, regardless of the storage component to big data, it is often processed in real-time, as it streams.  Processing streaming data needs MDM data even more than big data that lands in HDFS.  MDM data illuminates streaming data. 

Take fraud.  You can tell if a transaction is fraudulent in 2 ways.  One way is because it is trying to do something that is out-of-bounds regardless of who is doing it, like executing a buy and sell on the same security at the same time, which gives the trader an unrealized loss.  The other, and trickier, way is to compare the trade to the acceptable profile for the trader.

This profile obviously does not reside in a single transaction itself.  MDM is the discipline that builds that profile.  Though not ostensibly for transaction data, it is the accumulation of transaction data that builds up the interesting part of the customer profile.  MDM should receive the transaction (though not store it), gain intelligence and build a much more interesting customer (and other) profile from transaction data.  For example, MDM should contain spend-to-date and lifetime value metrics, which can only be derived from transactional data analysis.

As the profile updates in real-time based on transactions, MDM would then have the most up-to-date profile available to streaming data (and to big data for the batch Map Reduce jobs).  Only then can the stream effectively determine if the transaction is fraudulent, what the next-best-offer should be, etc.

This mentality holds true for big data in HDFS too.  Although processed in batch, outliers only become evident when the transaction is triangulated with master data.  If I’m collecting sensor activity on a product and the forklift that transported it at the warehouse, did the forklift take the most efficient route with the product?  Each product would have a different “best path” depending on its characteristics, and the characteristics of the order and the dynamics of what is going on simultaneously, all of which may be dynamically learned by the system processing big data. 

The masters that correspond to this activity are not part of the HDFS data.  They are part of MDM.

Whether for batch or real-time purposes, MDM is the glue that holds the enterprise systems together.  It allows for full value to be gained from enterprise information projects that no longer have to add in the effort and risk of building the masters that they need.   MDM can be discretely managed in support of the applications that will primarily use big data and streaming data.

Follow all Expert Blog updates by subscribing to the RSS RSS feed.

About the Author

William functions as Strategist, Lead Enterprise Information Architect, and Program Manager for complex, high-volume full life-cycle implementations worldwide utilizing the disciplines of data warehousing, master data management, business intelligence, data quality and operational business intelligence.

Jim Hare
January 23, 2012

William, I couldn’t agree more!  Combining MDM with Big Data is key to to not only compounding the value of all the data (streaming, unstructured, structured) but also critical to increasing the quality of information especially for operational decision-making.  Poor data quality can cause false positives, undetected events, and missed opportunities.  MDM provides the foundational capability for validating Big Data against trusted sources resulting in more improved decisions and significantly better outcomes.

Name:

Email:

Comment:

3 plus 11 is equal to?

Notify me of follow-up comments?