Tom Haughey

Recursion in Data Modeling - Part 2

By Tom Haughey on May 16, 2011
View Full Bio →

In the previous blog, we discussed the concept of recursion. In this blog we talk about ways of implementing recursion in a physical database design. There are at least seven ways to implement recursion:

  1. Implement the recursion "as is" (sometimes called the adjacency list)
  2. Implement the recursion with separate fixed levels
  3. Flatten the recursion (as a series of columns in one table)
  4. Populate the recursion as descendent tables
  5. Collapse the recursion into a single self-referencing entity
  6. Introduce explicit sequencing
  7. Implement the recursion as nested sets

But remember, each way has its pros and cons. No one solution fits all situations. Test your choice before you build it. Each solution is a trade-off.

1. Implementing Recursion "As Is"
In this solution, recursion is implemented as in the logical model, shown in Figure 1. Until recently, this was difficult to do in a relational DBMS because SQL alone was not able to support it. New DBMS features now support this. While the slowest performing solution we tested, the new DBMS features allow it to perform acceptably. This implementation is sometimes called the adjacency list.

Figure 1

In most of cases, it is useful (and in some SQLs necessary) to add some helper fields, such as the following:
Maximum number of levels.
An indicator for the top or bottom of the hierarchy.
A sequence number, preferably universal (which runs sequentially throughout the entire hierarchy).

2. Implementing Recursion with Fixed Levels

Figure 2

In this solution, recursion is eliminated. Instead, the different levels of a hierarchy are implemented as separate tables. This solution is useful only where the levels are stable and few. It is also useful if the combined number of attributes from the levels, such as Product, Subgroup and Group above, are many. Narrow tables perform better than wide tables.

This solution will usually perform better than recursion. Its disadvantage is that it is rigid. What happens if the hierarchy changes? What happens if new levels are needed? It is more flexible than flattening the hierarchy but less flexible than using a “bill of materials” (BOM) recursion. It involves fewer batch updates than flattening the hierarchy because it minimizes redundancy.

In dimensional modeling, this is called "snowflaking" a dimension. Snowflaking is not as dreaded as some would make you believe. A snowflaked dimension, as in Figure 2, is preferable under the following conditions:

  • A dimension is wide, such as a household dimension with many attributes;
  • A dimension should be subtyped because it has both generalized and specialized attributes, such as a product dimension with some common attributes for all products but also some attributes that differ by product;
  • Where some attributes occur multiple times within a dimension, such as a calendar dimension with common attributes but also country-specific attributes;
  • Finally where the relationships across dimensions are many-to-many, such as the relationship between customer and account data in a financial business. A customer can have many accounts and an account can belong to many customers. The snowflake consists of three tables in this case: customer, customer-account, account. The associative table, customer-account, is often called a bridge table. Some facts might relate to customer, others to account, and still others to customer-account.

3. Flattening the Hierarchy (As a Series of Columns in One Table)

Figure 3

Here also recursion is eliminated but the separate levels are collapsed into a single row. (We show them here on the left as a fixed hierarchy rather than a recursion only to clarify the levels in the example.) Each column (or group of columns) in the row represents a given level. In this example, each combined set of code and description represents a level. In a flattened hierarchy, the levels are still fixed but strung out in a single row one after the other. This is how a typical “star schema” dimension handles it.

Its advantages are that there are fewer joins and better performance than recursion or fixed levels - as long as the flattened table is not very wide. If the dimension is very wide, fixed levels are better.

Its disadvantages include:

  • Redundancy, because the higher levels are always duplicated at each lower level. When the hierarchy changes, the changes have to be applied to all the duplicated data.
  • If the table has many attributes and levels, it is harder to understand what the levels are. In addition, performance can be compromised.
  • The cost of additional storage on large tables with a lot of redundancy.

Consider the effect of a ragged hierarchy on this solution. A ragged hierarchy is one in which a member at any level can have different levels of children below it or parents above it. In other words, it is irregular. Typically a product BOM or an organizational structure are ragged hierarchies. One product goes to one depth, another product to a different depth. One org breaks down five levels, another breaks down seven levels. Since this solution reduces an irregular hierarchy to a regular one, it must allow for the maximum number of levels. When a ragged hierarchy is reduced in this way, some levels may be skipped. That’s why it is ragged. When a level is skipped, there are two solutions to signal the skipped level: you can repeat data for the next higher level in the missing (lower) levels; or (if the DBMS supports this) use a skip flag to indicate the missing level.

4. Populating Descendent Tables

Figure 4

A network recursion, being an adjacency list, typically has entries from any parent to its immediate children. The entire recursion is navigated by going from parent to immediate children, then using the children as parents to get their immediate children, and so forth. Navigating the entire hierarchy requires navigating multiple parent-child relationships. Descendent tables use the same structure as a typical network recursion. The difference is in how it is populated. Descendent tables have instances from each parent to all of its children - immediate, intermediate and ultimate. In other words, descendent tables populate paths from every parent to every child of that parent.

This solution is sometimes called helper tables, bridge tables or speed tables. These names are too generic and do not get to the real nature of this solution. In sandlot baseball we used to say, “I call them the way I see them”. It is better to call them “descendent tables” because it is more correctly descriptive.

5. Implementing Self-Referencing Structures

Figure 5

In this example, Organization is a self-referencing table. The relationship from parent to child organization is a true hierarchy, not a network. SQL has long supported SELF-JOINS and can support it through correlated sub-queries.

Yet, in some cases, it can be beneficial similarly to collapse a network recursion into a single self-referencing table. Remember in a network recursion, the relationship between primary entities, such as organization in this example, is many-to-many. If this solution is used for a network recursion, the Parent Org data should be included in every row. This would introduce redundancy but would eliminate the network recursion, and the bouncing from primary table to associative table in the BOM. It may also be necessary to increase the granularity of Organization so that the Org can belong to different parents in different hierarchies. This can be done by introducing a hierarchy table as in Figure 1 and including the hierarchy key as part of the Org key. If the structure is ragged and very variable, this may be impractical and artificial.

6. Introducing Explicit Sequencing
Any standard recursion can be accelerated by including a sequence number as an extra attribute. The sequence number indicates exactly that: the order of the entries. The sequence number allows all entries within a single parent to be retrieved in a certain order by using only the primary key and the sequence number. This is a common practice in manufacturing BOMs.

7. Implementing Recursion as Nested Sets
This is also called a modified preorder tree traversal. Let us first explain this. We will use a simple, well-know food example as defined by Gijs Van Tulder. Here are his general steps:

  • Lay out the tree in a horizontal way.
  • Start at the root node ('Food'), and write a 1 to its left. Follow the tree to 'Fruit' and write a 2 next to it.
  • In this way, you walk (traverse) along the edges of the tree while writing a number on the left and right side of each node. Call these numbers LFT and RGT (e.g. the left value of 'Food' is 1, the right value is 18). These numbers indicate the sequential relationship between nodes.
  • The last number is written at the right side of the 'Food' node.

Figure 6 (example from Gijs Van Tulder)

Once understood, it is clear that you can do almost everything with this technique that you could do with the standard hierarchy recursion (the adjacency list). It is very fast, and it can be done in basic SQL. Updating the tree takes more queries and is slower, but retrieving the nodes is achieved with only one query and is very fast indeed.

There are three disadvantages to this solution:

  • It is difficult to understand;
  • When changes happen, the entire hierarchy must be rebuilt, which can be very time consuming. For a large data warehouse, with many daily organizational or product changes daily, maintenance of hierarchies built this way could be prohibitive.
  • It supports only hierarchy recursion, not network.

Summary
No one solution to recursion is ideal in all cases. Each one has its pros and cons. Each is a compromise. The answer is to find the best compromise for your particular solution assessing the advantages and disadvantages of each one.

Follow all Expert Blog updates by subscribing to the RSS RSS feed.

About the Author

Tom Haughey is considered one of the four founding fathers of Information Engineering in America. He is currently President of InfoModel, LLC, training and consulting company. His courses on data management, data warehousing, and software development have been delivered to Fortune 100 companies around the world.

There have been no comments yet.

Name:

Email:

Comment:

An … a day keeps the doctor away. What word is missing?

Notify me of follow-up comments?