The Data Vault Method for Modeling the Data Warehouse
The data vault method for modeling the data warehouse was born of necessity. Data warehouse projects classically have to contend with long implementation times. This means that business requirements are more likely to change in the course of the project, jeopardizing the achievement of target implementation times and costs for the project.
To improve implementation times, Dan Linstedt introduced the Data Vault method for modeling the core warehouse. The key design principle involves separating the business key, context, and relationships in distinct tables as hub, satellite, and link.
The data vault is currently the established modeling standard for modelling the core data warehouse because of the many benefits it offers. These include the following:
Data Warehouse Benefits
• Easy extensibility enables an agile project approach
• The models created are highly scalable
• The loading processes can be optimally parallelized because there are few synchronization points
• The models are easy to audit
But alongside the many benefits, Data Vault projects also present a number of challenges. These include, but are not limited to, the following:
Data Warehouse Drawbacks
• A vast increase in the number of data objects (tables, columns) as a result of separating the information types and enriching them with meta information for loading
• This gives rise to greater modeling effort comprising numerous unsophisticated mechanical tasks
How can these challenges be mastered using a standard data modeling tool?
The highly schematic structure of the models offers ideal prerequisites for generating models. This allows sizable parts of the modeling process to be automated, enabling Data Vault projects to be accelerated dramatically.
Potential for Automating Data Vault
Which specific parts of the model can be automated?
The standard architecture of a data warehouse includes the following layers:
- Source system: Operational system, such as ERP or CRM systems
- Staging area: This is where the data is delivered from the operational systems. The structure of the data model generally corresponds to the source system, with enhancements for documenting loading.
- Core warehouse: The data from various systems is integrated here. This layer is modeled in accordance with Data Vault and is subdivided into the raw vault and business vault areas. This involves implementing all business rules in the business vault so that only very simple transformations are used in the raw vault.
- Data marts: The structure of the data marts is based on the analysis requirements and is modeled as a star schema.
Both the staging area and the raw vault are very well suited for automation, as clearly defined derivation rules can be established from the preceding layer.
Should automation be implemented using a standard modeling tool or using a specialized data warehouse automation tool?
Automation potential can generally be leveraged using special automation tools.
What are the arguments in favor of using a standard tool such as the erwin Data Modeler?
Using a standard modeling tool offers many benefits:
- The erwin Data Modeler generally already includes models (for example, source system), which can continue to be used
- The modeling functions are highly sophisticated – for example, for comparing models and for standardization within models
- A wide range of databases are supported as standard
- A large number of interfaces are available for importing models from other tools
- Often the tool has already been used to model source systems or other warehouses
- The model range can be used to model the entire enterprise architecture, not only the
data warehouse (erwin Web Portal)
- Business glossaries enable (existing) semantic information to be integrated
So far so good. But can the erwin Data Modeler generate models?
A special add-in for the erwin Data Modeler has been developed specifically to meet this requirement: MODGEN. This enables the potential for automation in erwin to be exploited to the full.
It integrates seamlessly into the erwin user interface and, in terms of operation, is heavily based on comparing models (complete compare).
The following specific functionalities are implemented in MODGEN:
- Generation of staging and raw vault models based on the model of the preceding layer
- Generation is controlled by enriching the particular preceding model with meta-information, which is stored in UDPs
- Individual objects can be excluded from the generation process permanently or
- Specifications for meta-columns can be integrated very easily using templates
To support a modeling process that can be repeated multiple times, during which iterative models are created or enhanced, it is essential that generation be round-trip capable.
To achieve this, the generation always performs a comparison between the source and target models and indicates any differences. These can be selected by the user and copied during generation.
The generation not only takes all the tables and columns into consideration as a matter of course (horizontal modeling), it also creates vertical model information.
This means the relationship of every generated target column to its source column as data source is documented. Source-to-target mappings can therefore be generated very easily using the model.
Integrating the source and target model into a web portal automatically makes the full impact and lineage analysis functionality available.
If you are interested in finding out more, or if you would like to experience MODGEN live, please contact our partner heureka.
Author details: Stefan Kausch, heureka e-Business GmbH
Stefan Kausch is the CEO and founder of heureka e-Business GmbH, a company focused on IT consultancy and software development.
Stefan has more than 15 years’ experience as a consultant, trainer, and educator and has developed and delivered data modeling processes and data governance initiatives for many different companies.
He has successfully executed many projects for customers, primarily developing application systems, data warehouse automation solutions and ETL processes. Stefan Kausch has in-depth knowledge of application development based on data models.
heureka e-Business GmbH
Untere Burghalde 69
Tel.: 0049 7152 939310