A data catalog benefits organizations in a myriad of ways. With the right data catalog tool, organizations can automate enterprise metadata management – including data cataloging, data mapping, data quality and code generation for faster time to value and greater accuracy for data movement and/or deployment projects.
Data cataloging helps curate internal and external datasets for a range of content authors. Gartner says this doubles business benefits and ensures effective management and monetization of data assets in the long-term if linked to broader data governance, data quality and metadata management initiatives.
But even with this in mind, the importance of data cataloging is growing. In the regulated data world (GDPR, HIPPA etc) organizations need to have a good understanding of their data lineage – and the data catalog benefits to data lineage are substantial.
Data lineage is a core operational business component of data governance technology architecture, encompassing the processes and technology to provide full-spectrum visibility into the ways data flows across an enterprise.
There are a number of different approaches to data lineage. Here, I outline the common approach, and the approach incorporating data cataloging – including the top 5 data catalog benefits for understanding your organization’s data lineage.
The most common approach for assembling a collection of data lineage mappings traces data flows in a reverse manner. The process begins with the target or data end point, and then traversing the processes, applications, and ETL tasks in reverse from the target.
For example, to determine the mappings for the data pipelines populating a data warehouse, a data lineage tool might begin with the data warehouse and examine the ETL tasks that immediately proceed the loading of the data into the target warehouse.
The data sources that feed the ETL process are added to a “task list,” and the process is repeated for each of those sources. At each stage, the discovered pieces of lineage are documented. At the end of the sequence, the process will have reverse-mapped the pipelines for populating that warehouse.
While this approach does produce a collection of data lineage maps for selected target systems, there are some drawbacks. First, this approach focuses only on assembling the data pipelines populating the selected target system but does not necessarily provide a comprehensive view of all the information flows and how they interact.
Second, this process produces the information that can be used for a static view of the data pipelines, but the process needs to be executed on a regular basis to account for changes to the environment or data sources.
Third, and probably most important, this process produces a technical view of the information flow, but it does not necessarily provide any deeper insights into the semantic lineage, or how the data assets map to the corresponding business usage models.
An alternate approach to data lineage combines data discovery and the use of a data catalog that captures data asset metadata with a data mapping framework that documents connections between the data assets.
This data catalog approach also takes advantage of automation, but in a different way: using platform-specific data connectors, the tool scans the environment for storing each data asset and imports data asset metadata into the data catalog.
When data asset structures are similar, the tool can compare data element domains and value sets, and automatically create the data mapping.
In turn, the data catalog approach performs data discovery using the same data connectors to parse the code involved in data movement, such as major ETL environments and procedural code – basically any executable task that moves data.
The information collected through this process is reverse engineered to create mappings from source data sets to target data sets based on what was discovered.
For example, you can map the databases used for transaction processing, determine that subsets of the transaction processing database are extracted and moved to a staging area, and then parse the ETL code to infer the mappings.
These direct mappings also are documented in the data catalog. In cases where the mappings are not obvious, a tool can help a data steward manually map data assets into the catalog.
The result is a data catalog that incorporates the structural and semantic metadata associated with each data asset as well as the direct mappings for how that data set is populated.
And this is a powerful representative paradigm – instead of capturing a static view of specific data pipelines, it allows a data consumer to request a dynamically-assembled lineage from the documented mappings.
By interrogating the catalog, the current view of any specific data lineage can be rendered on the fly that shows all points of the data lineage: the origination points, the processing stages, the sequences of transformations, and the final destination.
Materializing the “current active lineage” dynamically reduces the risk of having an older version of the lineage that is no longer relevant or correct. When new information is added to the data catalog (such as a newly-added data source of a modification to the ETL code), dynamically-generated views of the lineage will be kept up-to-date automatically.
A data catalog benefits data lineage in the following five distinct ways:
One can impose additional data governance controls with project management oversight, which allows you to designate data lineage mappings in terms of the project life cycle (such as development, test or production).
Aside from these data catalog benefits, this approach allows you to reduce the amount of manual effort for accumulating the information for data lineage and continually reviewing the data landscape to maintain consistency, thus providing a greater return on investment for your data intelligence budget.