Posts From This Author
About Our Authors
Data Integration Maturity Factors, non-Googlephonic
By William McKnight on October 5, 2010View Full Bio →
Listening to some people talk about their data integration environments reminds me of the old Steve Martin bit about his speaker system.
“So, I traded that in and got the googlephonic, which is the highest number of speakers you can have before infinity. Sounds like s&*%! So, then I said, “Hey, maybe it’s the needle!” I had the typical diamond needle. I searched around got the moonrock needle, cost me 3 million bucks, but what the hey. So, now I have a googlephonic stereo with a moonrock needle. It’s okay for a car stereo, I wouldn’t want it in my house.”
Data integration is important to data modeling. In analytical systems, it’s how the model is usually populated. In operational systems, it organizes how the data is collected. In some environments a modeler wears the data integration hat as well. Regardless, the two activities are about as intertwined as any two activities can be.
There are clearly many factors that can be used to evaluate the maturity of data integration within an organization that is well beyond its MPP-ness, its ability to do 128 parallel streams and other googlephonic measures.
How do you know if you’ve done a good job at data integration tool selection? Often, you’re choosing between 2 or 3 high-powered and high-budget tools that are known to be able to do the job you’re going to ask of it. Perhaps you chose Informatica. Great job! What brilliance!
However, after the tool is chosen, the implementation will not lend itself so easily to brilliance. Many factors will come into play that are ultimately related to how well data integration keeps those models populated with fresh data.
I’m going to list and explore some of these factors here. Whatever you are using to judge the maturity and value of your data integration system, make sure these factors are included and not just the tool’s googlephonicness.
- Data currency is as real-time as possible – it is not hindered by limitations of the data integration (although the source system may prove challenging)
- Data integration cycle is able to be run anytime (except for performance impact on source and data warehouse systems)
- Data integration is scheduled, not manually run
- Data integration is up-to-date – it is not presently or persistently “behind” schedule
- Data integration is non-intrusive to the source environment – it does not inhibit any function of the source environment
- Data integration is non-intrusive to the target environment – it does not inhibit any function of the target environment, like causing it to be inaccessible when users would like to use it
- Data integration sources changed-only data – it only picks up data that it has not yet seen, no “full refreshes”
- Data integration is well-performing – it runs within the allotted window, and is not pushing the limits and creating concern; in the event of load failure, there is time for recovery and reload.
- Data integration is consistent, running in “lights out, hands off” mode – no manual intervention is planned
- Data integration code is maintained by version control
- A code shell is maintained that provides consistent error trapping and code organization from developer to developer
- Data integration coding standards are maintained and conformed with – naming structures, code libraries, change control, code standards
- Data integration code is extensively commented
- Code reviews are performed on a regular basis and are part of scheduled development efforts
- Load validation is performed as an integral part of data integration, not as an afterthought - both row counts and relevant balances/amounts are reconciled to source and other references after each refresh
- All data integration code is documented in a consistent format, with the documentation centrally located, consistent, accessible, and complete
- Triage or over-sourcing is performed in source system extracts – more is sourced than is immediately necessary
- Source data calculations are verified before sourcing – source calculations should not be inherently trusted
- The fields underlying the calculations are sourced as well as the calculations themselves
- Source-target mapping occurs for all data integration work
Modelers will need to see their models populated in order to create any benefit to the organization. As a result, many take an active, rather than passive, approach to data integration. Being somewhat removed from the technical data integration factors can be a benefit and not a detriment. At the end of the day, it’s these factors in the data integration layer that make the model come alive.
I welcome your comments to any of these factors or your additions.
Follow all Expert Blog updates by subscribing to the
RSS feed.
About the Author
William functions as Strategist, Lead Enterprise Information Architect, and Program Manager for complex, high-volume full life-cycle implementations worldwide utilizing the disciplines of data warehousing, master data management, business intelligence, data quality and operational business intelligence.
I would add ETL must run in parallel and be supported 24 X 7.





















October 14, 2010