Alec Sharp

Shortcuts and Loops – Dealing With Transitivity in Data Models

By Alec Sharp on May 20, 2011
View Full Bio →

You never know what topics people will find interesting. My last couple of posts looked at making Fourth and Fifth Normal Form understandable to the average data modeler and other mere mortals. Even to a data modeling fanboy like me, a 4NF/5NF discussion sounded like a real snoozer, but evidently it wasn’t (at least to some of you – you know who you are.) Via Twitter and email I received a number of favorable comments. One correspondent, the inimitable Mike Gorman (http://www.wiscorp.com/), went a bit further and spotted a potential issue in one of my sample data models that I had only obliquely noted. I’d intended to use it as the topic of a future post, but now that Mike has called it out, it seems an appropriate topic for this month’s post.

What Mike spotted was some potentially transitive relationships. This struck me as a good topic for a post because it’s one of the three most common structural problems I see in models developed by inexperienced modelers. I knew you’d ask… the other two are:

  1. Lots of entities that don’t represent true business entities – they represent forms, reports, views, or other artifacts of the business system. This clutters up the model unnecessarily, introduces the usual data integrity problems related to unnormalized (as opposed to denormalized) data, and ties the model to a point in time.
  2. A tendency to bolt on new entities (“spare parts”) whenever a new data need arises, rather than integrating that need into the existing structures. I call this the “Frankenmodel” phenomenon, referring to Mary Shelley’s fictional monster, bolted together from “spare parts.” If you want to see it for yourself, look at any database that’s been in production for several years in an unmanaged environment.

What transitivity and these other two problems have in common is excess – excess entities, excess relationships, and excess attributes. That leads us into the definition of transitivity, which could be described as an excess relationship.

Transitivity – potential and actual

A relationship in an Entity-Relationship Diagram is said to be transitive if it is unnecessary because the same information is available by traversing more fundamental relationships. In other words, it’s a redundant relationship. As with redundant attributes, they should be avoided during data modeling because they confuse things by taking focus from the essence, and because premature design decisions have an insidious way of compounding themselves. Your database designer or DBA may choose to add them – judiciously – during physical database design. I say judiciously because redundancy in operational data always introduces the risk of data integrity problems.

Like many of our terms, “transitivity” originated in mathematics, so we’ll use their definition as the starting point in illustrating transitivity:
“In mathematics, a binary relation ... is transitive if whenever an element a is related to an element b, and b is in turn related to an element c, then a is also related to c.”
(Thank you, Wikipedia – I’m writing this in Mumbai, and I forgot to bring my old math texts with me.) Examples include:

  • whenever A > B and B > C, then also A > C
  • whenever A ≥ B and B ≥ C, then also A ≥ C
  • whenever A = B and B = C, then also A = C

The parallels with a data model are clear. Relationships in an ERD are binary (they relate exactly two entities) and if Entity A is related to Entity B, and Entity B is related to Entity C, then Entity A is, no doubt about it, related to Entity C via Entity B. But where’s the transitive (or potentially transitive) relationship, the one that introduces the demon transitivity? Well, it’s not there yet, so let’s add a relationship from Entity C to Entity A, as illustrated below. That’s the culprit – a potentially transitive relationship. Let’s also add an example that’s more interesting than A, B, and C.

Now that we have a visual, we can see the usual clue to transitivity – relationships that form a “loop.” In the example, “A to B to C and back to A” is a loop. Another way to express it is that there are two paths between a given pair of entities, which we see in the example. We have two paths between C and A – one directly, and one via B. The relationships in the Customer – Facility – Order example show the same thing. The relationship we added, “C is related to A,” is potentially a transitive (unnecessary) relationship, as is “Order placed by Customer.” However, we don’t know yet if either example really has a transitive relationship – just like an item found at a crime scene, the loop is only a clue, not proof.

Investigating potential transitivity

To prove the crime of transitivity, the basic approach is to inspect each relationship in the loop, one at a time, and determine if it could be eliminated without the loss of any information. (I was told years ago this is called “information loss analysis,” but I really don’t know if it’s the right term.) If a relationship could be eliminated, then it is transitive (redundant) and should be dropped from the model. As usual, we can’t make that determination unless we know the meaning of the relationships. Let’s try it with the example in which Customers place Orders for delivery to one of their Facilities:

  • Facility “operated by” Customer is necessary information in this business – assume that otherwise, we wouldn’t know whether a given Customer is allowed to place Orders for a given Facility. That relationship is essential, and can’t be eliminated;
  • Similarly, Order “for delivery to” Facility is essential and must be maintained, otherwise we wouldn’t be able to record which of a Customer’s many Facilities an Order is for. Sprinkling bit of the Order around to each of the Customer’s various Facilities is not an acceptable option;
  • That leaves Order “placed by” Customer, which indeed can be eliminated because we can infer it from the other two. An Order is “for delivery to” exactly one Facility, and that Facility is “operated by” exactly one Customer. According to our friends the mathematicians, we therefore already know that Order is related to Customer, therefore Order “placed by” Customer adds no new information, just a shortcut. It is guilty of the crime of transitivity and is sentenced to elimination from the model.

Here’s the corrected version:

Sometimes I call an example like “Order placed by Customer” a “process relationship” because in the business process, it’s true that a Customer places Orders. However, that’s different than the more fundamental “data relationships” we must maintain in the interest of data integrity. The obvious problem with uncontrolled redundancy is that if we left that relationship in place and it was included in the physical database design, we could end up with cases in which the two paths from Order to Customer lead to different Customers. But what if it was normal for those paths to lead to different customers…?

“You can’t tell just by looking…”

In a different business situation, the “Order placed by Customer” relationship wouldn’t be transitive, it would be absolutely essential. If a Customer could place Orders for delivery to another Customer’s Facility, a plausible requirement, then this additional path would be necessary. One way to look at this is that the two paths from Order to Customer carry different information – one leads to the Customer that operates the delivery Facility, and the other leads to the Customer that placed the Order. Experienced data modelers are more likely to use this “paths contain different information” approach to confirm transitivity rather than go through all the steps of information loss analysis. That’s especially true when the alternate paths are long, traversing many relationships.

In the A-B-C example, we just can’t tell whether or not there is a transitive relationship because the “is related to” relationship name gives no clue whatever to the meaning of the relationships. That’s one more good reason to use descriptive relationship names. I once reviewed an 800+ entity data model in which every single relationship had the same name – “is related to.” I wanted to say something sarcastic like “Gee, Sparky, that’s what the line means – the name is supposed to tell us how they’re related.”

Next time

Hmm… I see that I’ve hit my target for the length of this post without ever getting around to the situation Mike Gorman spotted. Typical, but I guess it means I have a guaranteed topic for next time around. We’ll look at more situations in which loops in a data model offer a clue to transitivity, including some with much longer paths than our simple model. However, while these loops won’t indicate transitivity, they will highlight the need for crucial business rules the vigilant data modeler (e.g., Mr. Gorman) must always be on the lookout for.

Take care!

Follow all Expert Blog updates by subscribing to the RSS RSS feed.

About the Author

Alec Sharp has managed his consulting and education business, Clariteq Systems Consulting Ltd., for close to 30 years. Serving clients from Ireland to India, and Washington to Wellington, Alec has expertise in a rare combination of fields - data management, business analysis, business process improvement, and enterprise architecture.

David Jaques-Watson
July 4, 2011

You may like to add: if a Customer can place an Order for another Customer, and if both Customer keys end up in the Order, then the recommendation is to rename the attributes within both relationships. You end up with “Ordering Customer” and “Receiving Customer”.

The trap for young players is that ERwin will not do this for you! (Or even warn you of such cases.) It is quite happy to link two relationships to a single child attribute. You have to be aware of and watch out for these situations.

BTW, with the “is related to” issue, I keep a list of good relationship phrases that I can re-use (or alter); e.g “classifies / is classified by”, “identifies the status of”, etc. However, I find it much harder to create good phrases than good entity names. Do you have a similar list?

Tim Hosking
July 4, 2011

Hi Alec,
Good article. The point you make is important when talking about database design, however for a model that is trying to capture the semantics of the business, then I would argue that the transitive relationship may be quite important as an expression of a business rule and in that case should not be dropped - even if it is not implemented in a database. The concept that “an Order is placed by a Customer” is important even if it was not true that “a Customer could place Orders for delivery to another Customer’s Facility”.

I follow the practice of keeping those relationships in my “conceptual” or semantic model,while dropping them from my “logical” or database design model.

Alan Gredell
August 30, 2011

Alec, I’ve been getting a great deal of benefit from your blog; has CA cut you off?  Please, continue publishing, I just found it and have read every article so far, and am looking forward to more!!!

Thanks!

Name:

Email:

Comment:

An … a day keeps the doctor away. What word is missing?

Notify me of follow-up comments?