Alec Sharp

How to win friends and influence people through your mastery of 4/5NF

By Alec Sharp on March 18, 2011
View Full Bio →

Lately this blog has looked at why your E-R diagrams should have a sense of direction (dependency “flowing” from top to bottom) and how much easier that makes many things, including understanding and explaining normalization. This ability will surely make you the life of the party, especially when you can clear up the confusion your friends and neighbors undoubtedly feel about Fourth and Fifth Normal Form! This post will help by backing up the theory we covered last time with some practical examples. As usual, reading the previous few posts will help you make sense of this one.

4NF and 5NF – the essence
In the last post, we saw you can only have a potential violation of 4NF or 5NF if you have an associative entity with three or more parents and are already in 3NF. In other words you have a relationship involving three or more other entities. Using our drawing guidelines, this will be visually apparent as an entity with three or more crowsfeet on the top edge, as illustrated below.  Note – to simplify the wording, for the rest of this post I’ll refer simply to “three-way associatives” or “three parents” but the intent is that you read it as “three-way or higher associatives” or “three or more parents.”

This structure indicates that you might have a violation of 4th or 5th Normal Form, but only an understanding of the business rules will determine if you actually have a violation. This raises an interesting (to data modelers, anyway) difference between the higher and the lower Normal Forms.  With 1st, 2nd, and 3rd Normal Forms you can spot most violations structurally, and don’t need to involve the business experts for confirmation. However, with 4th and 5th  Normal Forms you simply cannot determine if there is a violation without understanding the relevant business rules. Often, it’s even necessary to understand the business process, in order to know the timing with which parts of the relationship come into existence. That’s why I always stress to participants in my Advanced Data Modeling workshops that in these cases you can’t tell whether a model is correct or not simply by inspecting it – you must have business involvement.

That gives rise to three further points:

  1. You must draw the model in a top-down fashion (or other systematic approach) so you can actually see dependencies and potential violations;
  2. You must state your assumptions in narrative form as assertions, using terms (entity names, relationship names, and attribute names) from the data model;
  3. You must illuminate the data model using sample data, schematic diagrams, scenarios, or some other understandable form in addition to the ERD.

A minor wrinkle
Several years ago, at a course I was teaching in New Jersey, one of the participants pointed out that he had never seen a case in which a three-way association was valid. In his experience, they inevitably represented a violation of 4/5NF, and needed to be decomposed into “smaller” associatives. I was surprised by this insight, so over the next while I searched various complex models I’d worked on for three-way associatives, and could find virtually none. Not none, but almost none, so the class participant was essentially correct.

Examining the situations that I’d dealt with in my models, there were many cases in which the business needed to associate three entities. Inevitably, though, two of the entities were associated before the remaining two were associated, because of practicality reasons that were generally related to the timing of the business process.  This means that if you do come across a three-way associative, it’s very likely to be wrong, but you’re not off the hook – you still need to involve the business to understand what the correct solution is.

Variations on a semi-classic example
The standard textbook example of a 4NF violation always involves a three-way association among Employee, Language, and Skill. As we’ll see, this same example can also turn out to have violated 5NF.

The point being made is that the fact that an Employee speaks a Language and the fact that an Employee possesses a Skill are independent, unless you want to stretch the example with skills that are language dependent. (See – you always have to know the business rules.) The appropriate model will have separate Employee-Language and Employee-Skill associatives, showing that the original violated 4NF. It’s not a 5NF violation because there’s no Skill-Language associative that would form a “round-trip.”

The problem with this example is that it always feels a bit artificial – really, you’d have to be asleep on the job to create the original three-way association. Let’s look at another example that also pops up as a “standard example” but has the additional benefit that it’s plausible!

A more realistic example
In this semi-classic example, Agents represent Manufacturers in certain Regions. In my real-life experience, it was Sales Representatives selling Advertising Products in Geographic Market Segments, but we’ll stick with the classic – it uses fewer, shorter words!

The presumption in this model is that any combination of Agent, Manufacturer, and Region is valid. So, if you wanted to assign a Phoenix-based agent to represent a snowblower manufacturer’s products in Brazil, this model will let you. Practically, though, there are some more basic relationships that are established first, independently. For instance, Agents don’t have agreements with every Manufacturer, only a few. And Manufacturers don’t sell into every Region, only those that make sense. With the original model, when an Agent signed an agreement with a Manufacturer, that information couldn’t be recorded if the Agent hadn’t yet been assigned any Regions. Unless, of course, you created a dummy “Unassigned” region, but you’d never do that, would you? Later, some combinations of Agent and Manufacturer would be carried multiple times, redundantly, for every Region the Manufacturer sells into. So, the original model exhibited all the typical problems with structures that don’t conform to 4/5NF – the inability to carry partial relationships, and carrying partial relationships redundantly. That leads to the following model, which (so far) indicates that the original model violated 4NF.

A refinement on a more realistic example
The above looks pretty good until you consider that the implicit assumption is that an Agent that represents a Manufacturer does so in all of the Manufacturer’s Regions. That might have worked when the business was small, but could lead to cases in which Phoenix-based Agents represent Manufacturers who sell into Brazil. In all likelihood, there is some additional rule to prevent Agents from being assigned all over the world, for instance that Agents are assigned to cover only specific Regions. Here’s an updated model representing the addition of this rule:

Now, because we have a “round trip,” the original model turns out to have been a violation of 5NF. But… there’s still a problem, which is that in this model there are some implicit assumptions that most businesses would want to make explicit. The implicit assumption is that when an Agent represents a Manufacturer, and that Manufacturer distributes in a Region, and the Agent covers that Region, then the Agent represents the Manufacturer in that Region. Realistically, though, the actual assignment of Agents will be more fine-grained. And, of course, the whole point of the 4/5NF discussion is to get relationships down to the most fine-grained (or granular) level required by the business. So, let’s see where this example is likely to end up in practice.

A more realistic refinement on a more realistic example
Now we’ll add an additional rule that makes the example behave more like the Sales Representative – Advertising Product – Geographic Market Segment example I mentioned earlier. We still have the constraint that Agents only represent certain Manufacturersand the constraint that Manufacturers only distribute in certain Regions. However, there is an additional constraint – the Manufacturer decides which of their Agents will cover which of their Regions. Now, we have yet another version of the data model.

I’m not sure if this means the original violated 4NF or 5NF (4NF, I think) but as I indicated before, I don’t care – the key point is that the original 3-way associative was “too big,” and we’ve decomposed it into “smaller” associatives that represent the rules of the business and its processes. It’s instructive that we saw three different variations on the original model, each representing a different set of rules, which demonstrates my assertion that you can’t tell whether or not you have a 4/5NF violation unless the business is involved and can explain the rules.

Looking ahead
The astute reader might notice two things:

  1. There are two paths from Agent-Mfr-Region up to Manufacturer, which indicates either that we need to specify additional rules, or we have an example of transitivity.
  2. I’ve left off the keys of the Agent-Mfr-Region entity, because it turns out we have choices, depending on how we handle point 1.

You guessed it – these will be the subject of a future post, but not the next one. 

A closing example – a non-violation
Having made the case earlier that a three-way associative is probably an error, let’s look at one of the rare cases in which it isn’t. Consider the model below, bearing in mind that it is a fragment of a larger model, and I have simplified it greatly so it will fit in here. For instance, Workstations are categorized into Workstation Types, each of which supports multiple Standard Operations, and Manufactured Product is a subtype of Product.

During the manufacturing process, a Lot (a “batch” or “production run”) goes through a sequence of Production Steps (again, a simplification.)

Here’s the relevant assertion – for each Production Step that is required for a Lot, the total required production quantity (Lot Quantity) will be allocated across one or more Workstations, each being assigned a Lot Allocation Quantity.

An example always helps too – for Lot A (which has a Lot Quantity of 500) and Production Step 13, a quantity of 200 units will be allocated to Workstation WS1 and 150 units will be allocated to Workstation WS2. The remaining 150 units have not been allocated yet.

What’s your thought – is this a violation of 4/5NF? It looks like it could be, because Lot Allocation is a 3-way associative. Close inspection, however, reveals that it isn’t. The approach, as always, is to see if any subsets of the relationship come into existence independently of each other. In this case, they don’t:

  • There is no new information requiring an associative between a Lot and a Production Step that must be captured; anything we need to know is already captured via the relationship through Lot to Manufactured Product to Production Step.
  • Similarly, there is no new information about either the relationship between a Production Step and a Workstation, or the relationship between a Lot and a Workstation.

In this case, the three-way relationship represented by Lot Allocation really does come into existence “all at once,” so there’s no 4/5NF violation.

Next time around
Now that you’ve got some examples that will impress your friends and neighbours at the next  “Neighbourhood Data Modeling Block Party,” it’s time to expand your abilities –how to present a data model to people who have no idea what a data model is. Note – that’s well over 99% of the world’s population, so you’ll have a large potential audience. I do a lot of presentations like this, and a frequent response is “That’s the clearest description I’ve ever heard of our business.” That’s the kind of reaction you want, and it isn’t that hard as long as you don’t let “data” get in the way – tune in next time to learn how.

Follow all Expert Blog updates by subscribing to the RSS RSS feed.

About the Author

Alec Sharp has managed his consulting and education business, Clariteq Systems Consulting Ltd., for close to 30 years. Serving clients from Ireland to India, and Washington to Wellington, Alec has expertise in a rare combination of fields - data management, business analysis, business process improvement, and enterprise architecture.

Mike Gorman
March 25, 2011

Alec:

What nags at me about the last diagram is the existance of two diamonds. Lot Allocation up through the left diamond (Lot and Production Step up to Manufactured Product) and the right diamond (Lot and Work station up to Standard Operation).

Isn’t there the possibility of inherent ambiguity via both diamonds.

What does it mean to have a lot allocation with a parent Workstation that, in turn, doesn’t have a Standard Operation. Similarly “ditto” for the left diamond.

And finally, what does it mean to start at Lot Allocation and through the right diamond via the left leg and right leg to end up at different instances of standard operation?

Worriedly yours,
Mike Gorman

Alec Sharp
March 28, 2011

Hey Mike -

Good spotting! You noticed the “easter egg” (the two diamonds) that I deliberately introduced in that model, and which will be the subject off the next post. I said the next one would be on presenting models, but that was actually a ruse - the next one will be on transitivity in models, including how to address that “double diamond” structure you noticed. We’ll also get into the fact that the modeler CANNOT rely just on the ERD, and must express much of the model (arguably the most important aspects) in narrative form, as Assertions and Definitions. As our friend Graham Witt says, “a data model is a narrative supported by a graphic.”

Thanks again for the insightful comments.

Mike Gorman
March 28, 2011

Alec:

I may have said that the ISO 11179 model has a diamond in it thus, it is inherently ambiguous. What makes it even more intesting, the top and sides of the ISO 11179 data model are BOMs.

It was truly a challenge to program against to ensure no ambiguity.

What my son, Mike Jr. (with TWO Ph.D) suggested is that there are three choices:

1) Who gives a crap
2) Abssolutely the same PkeyKid value for both left and right intersection, or
3) Both must ultimately have a same ancestor in common.

Mike Jr’s point was that since data models are representations of abstract representations of alleged real things, there can be many different and valid abstract representations, and if they ultimately converge, that may just be good enuf.

So, I chose #3 and it was real fun programming through three Bills of Materials to accomplish the task.

I’m pretty sure I’ve written about this on the Data Modeler Architecture and Concept of Operations book that I have under the Free Metabase link.

Regards,

Mike G

David Jaques-Watson
July 4, 2011

Mike: It’s less of a problem if the relationships are non-identifying.

Also, it becomes more interesting when you are modelling for a warehouse, and therefore have to relax the rules…

thananjayan chinnaswamy
July 28, 2011

Good one Alec!
The modern production units designed with work station that are specific to production step (in some cases few production steps combined together -process set), in that case you could very well keep Workstation and production step association in between and then you have lot allocation with only two parents.

These kind of scenarios through at us interesting challenges at times makes our job more interesting and lovaable.

Larry
February 26, 2012

It would have been helpful if you indicated and linked to the articles you reference and shown their order. “In the last post” is hard to find.

Name:

Email:

Comment:

What is missing: North, South, East?

Notify me of follow-up comments?