Tom Haughey

Database Refactoring

By Tom Haughey on December 1, 2010
View Full Bio →

Refactoring is a technique for restructuring an existing body of code by altering its internal structure without changing its external behavior. Advocates of Agile also advocate the practice of database refactoring. A database refactoring is an application of this concept to databases. Database refactoring represents a simple change to a database schema that improves its design while retaining both its behavioral and informational semantics. Database refactoring is more difficult than code refactoring. Code refactorings only need to maintain behavior. Database refactorings also must maintain existing content, and integrity and other business rules. The term “database” includes all database objects, such as tables and columns, and logic objects such as stored procedures and triggers. 

Agile subdivides database refactorings into several types. The different types with an example are: structural (change the schema), data quality (add a domain look-up), referential integrity (alter a column to non-nullable foreign key), architectural (make an algorithm a stored procedure), procedural (change a derivation). Transformations add new capabilities. A simple example of a structural refactoring is to take a column, such as Tax Identifier, and split into two columns, Federal Tax Id and Social Security Number. The external behavior of the application remains unchanged.

Agile proponents have written quite a bit about database refactoring in articles and books. Individual refactorings are broken down into fine detail, such as refactoring a foreign key.

Keep in mind that refactoring represents improvements to an existing design, not enhancements to requirements or major maintenance. So in refactoring, you are not adding or removing functionalities. Agile proponents, such as Scott Ambler, advocate the delivery of databases in small increments, and he in particular proposes that the logical data model is virtually useless. (See my earlier blogs). I believe the practice of breaking the database into small (say, 2-week) increments has many inherent risks and hidden costs.  We recommend having a well-defined data model in advance. This data model should not be huge but it should span multiple increments, including some future increments. It is possible to do this without doing a BDUF (big design up front). The issue is not in dividing work into increments; it is in the size of the increments. Two-week increments are just not productive. We believe that the total cost of ownership is greater when such small increments are used.

Refactoring the data warehouse poses some unique considerations. The problem is a little simpler in OLTP databases. DWs have large volumes of data, often with a large number of columns. I believe that repeated refactoring, sometimes even occasional refactoring, in a DW has hidden costs. Let us say that you are a soft drink company. You have two columns, Product Sold Quantity and Product Size. Product Sold Quantity tells you the quantity of a given product sold (such as a diet soda). Product Size tells you the bottle size (such as 20-ounce). You have data coming in from all different sources and countries. You really need a common unit of measure to report on volume sold. You decide that the answer is to use an 8-Ounce Equivalent Volume. Using an algorithm, you can calculate it from the other two. But this data element is used very frequently so you decide you need to store it and are going to refactor to add it. You knew about this before, but after all you had only two weeks. Better not to have dealt with it then. You have 1-billion rows in each of two fact tables. Altering the tables and writing the refactoring code is quite simple. However, since your main tables each have 1-billion rows, reloading such tables could take a great deal of time, even days - depending on your technology. In a mission-critical DW this is unacceptable.

Agile proponents also propose refactoring smells. These are tangible indications that the code or a database needs improvement. Smells are inefficiencies in the software. Database examples are hybrid columns, multi-use columns or tables, and redundant data.

This blog we defined refactoring and described some of its characteristics. In the next blog we will examine whether data modeling truly requires BDUF. Is there a middle ground between big project scopes and the two-week increments of Agile? Is there a better balance in making such choices?

Follow all Expert Blog updates by subscribing to the RSS RSS feed.

About the Author

Tom Haughey is considered one of the four founding fathers of Information Engineering in America. He is currently President of InfoModel, LLC, training and consulting company. His courses on data management, data warehousing, and software development have been delivered to Fortune 100 companies around the world.

There have been no comments yet.

Name:

Email:

Comment:

What is missing: North, South, East?

Notify me of follow-up comments?