Not Documenting End-to-End Data Lineage Is Risky Business – Understanding your data’s origins is key to successful data governance.
Not everyone understands what end-to-end data lineage is or why it is important. In a previous blog, I explained that data lineage is basically the history of data, including a data set’s origin, characteristics, quality and movement over time.
This information is critical to regulatory compliance, change management and data governance not to mention delivering an optimal customer experience. But given the volume, velocity and variety of data (the three Vs of data) we generate today, producing and keeping up with end-to-end data linage is complex and time-consuming.
Yet given this era of digital transformation and fierce competition, understanding what data you have, where it came from, how it’s changed since creation or acquisition, and whether it poses any risks is paramount to optimizing its value. Furthermore, faulty decision-making based on inconsistent analytics and inaccurate reporting can cost millions.
End-to-end data lineage explains how information flows into, across and outside an organization. And knowing how information was created, its origin and quality may have greater value than a given data set’s current state.
For example, data lineage provides a way to determine which downstream applications and processes are affected by a change in data expectations and helps in planning for application updates.
As I mentioned above, the three Vs of data and the integration of systems makes it difficult to understand the resulting data web much less capture a simple visual of that flow. Yet a consistent view of data and how it flows is paramount to the success of enterprise data governance and any data-driven initiative.
Whether you need to drill down for a granular view of a particular data set or create a high-level summary to describe a particular system and the data it relies on, end-to-end data lineage must be documented and tracked, with an emphasis on the dynamics of data processing and movement as opposed to data structures. Data lineage helps answer questions about the origin of data in key performance indicator (KPI) reports, including:
Why do so many organizations struggle with end-to-end data lineage?
The struggle is real for a number of reasons. At the top of the list, organizations are dealing with more data than ever before using systems that weren’t designed to communicate effectively with one another.
Next, their IT and business stakeholders have a difficult time collaborating. And, for a lot of organizations, they’ve relied mostly on manual processes – if data lineage documentation has been attempted at all.
The risks of ignoring end-to-end data lineage are just too great. Let’s look at some of those consequences:
Effectively managing business operations is a key factor to success– especially for organizations that are in the midst of digital transformation. Failures in business processes attributed to errors can be a big problem.
For example, in a typical business scenario where an incorrect data set is discovered within a report, the length of time (on average) that it takes a team to find the source of the error can take days or sometimes weeks – derailing the project and costing time and money.
The business glossary environment must represent the actual environment, e.g., be refreshed and synched, otherwise it becomes obsolete. You need real collaboration.
Data dictionaries, glossaries and policies can’t live in different formats and in different places. It is common for these to be expressed in different ways, depending on the database and underlying storage technology, but this causes policy bloat and rules that no organization, team or employee will understand, let alone realistically manage.
Effective data governance requires that business glossaries, data dictionaries and data privacy policies live in one central location, so they can be easily tracked, monitored and updated over time.
Successful data migration and upgrades rely on seamless integration of tools and processes with coordinated efforts of people/resources. A passive approach frequently relies on creating new copies of data, usually with sensitive identifiers removed or obscured.
Not only does this passive approach create inefficiencies between determining what data to copy, how to copy it, and where to store the copy, it also creates new volumes of data that become harder to track over time. Yet again, a passive approach to data cannot scale. Direct access to the same live data across the organization is required.
Metadata management and manual mapping are a challenge to most organizations. Data comes in all shapes, sizes and formats, and there is no way to know what type of data a project will need – or even where that data will sit.
Some data might be in the cloud, some on premise, and sometimes projects will require a hybrid approach. All data must be governed, regardless of where it is located.
Privacy and compliance personnel know the rules that must be applied to data, but may not necessarily know the technology. Instead, automated data governance requires that anyone, with any level of expertise, can understand what rules (e.g. privacy policies) are applied to enterprise data.
Organizations with established data governance must empower both those with technical skill sets and those with privacy and compliance knowledge, so all teams can play a meaningful role controlling how data is used.
For more information on data lineage, get the free white paper, Tech Brief: Data Lineage.