Alec Sharp

Normalization for Mere Mortals

By Alec Sharp on December 30, 2010
View Full Bio →

My last two posts discussed the how and why of drawing an Entity-Relationship Diagram so dependency flows from top to bottom. (Be forewarned that if you haven’t read those posts, this one won’t make much sense.) As usual, it’s been a long and winding road. My original intent was to illustrate how to explain normalization to analysts, developers, and other mere mortals. That, however, requires following certain graphic principles, hence the detour (useful, I hope!) into diagramming conventions. Now, at last, we’ll get to a short discussion on normalization. Almost. 

Three benefits of drawing ERDs top-down
First, I’d like to wrap up the diagramming topic with just three of the numerous benefits I’ve seen from the top-down approach.

  1. This is the one I stressed from the outset – most people who participate in developing, reviewing, and using a data model (principally business analysts, subject matter experts, and developers) simply find it easier to understand and follow an ERD drawn top-down. This means the model has a greater chance of being correct, and a greater chance of being used correctly.
  2. To expand on a point I made in the last post, when I review a logical data model that I have redrawn in top-down format, I often find relationships that have been attached incorrectly –  “side to side” (or non-parent-child) relationships have been attached either too “high” or too “low” in the dependency hierarchy. If a relationship is attached too high, then granularity needed by the business isn’t captured. This arose in a model for a training organization in which a Student was related to a Course in which they were registered (“too high”) rather than to a specific Course Offering, which is a dependent of Course (“lower.”)  In another model, a relationship was too “low” – a Purchase Item was related to a particular Store Display Location (aisle, shelf, bin) at which a Product was displayed for sale in a retail Store. (Some Products were displayed in multiple locations.) The business would have loved this granularity for the analysis it would allow, but there was no way for the checkout process, whether cashier or self-serve, to know which location a particular item came from. This might be feasible now that RFIDs are relatively cheap, but at the time, it was an example of A Relationship Too Low.
  3. When transforming a top-down ERD to a dimensional model, the Facts are easy to spot because they will usually be based on an associative entity close to the bottom of the diagram. The Dimensions will be based on the entities encountered when following M:1 relationships “up” through parents, grandparents, and so on towards the Kernels at the top. If you find yourself doing a U-turn and going back “down,” in the 1:M direction, it’s time to stop and do some deep thinking about what you’ll do next. Probably back up. Hmmm – sounds like a future post.

Now can we discuss normalization? Soon. Very soon

The pop quiz
In the last post I showed the diagram below, but without the entity-type labels, and asked you to count how many of each of the four entity types were depicted based on nothing but the visual cues I had described. Here’s the answer – as the labels illustrate, there are:

  • 3 kernels
  • 4 characteristics
  • 1 associative
  • 1 reference

If you resolved the M:M relationship in the middle of the diagram between the characteristic and the kernel, creating the associative entity I’ve added with the dashed lines, you would have a total of 2 associatives.

Are we there yet? Yes!

Normalization made complex

As I said in the first post on this topic, I run Data Modeling and Advanced Data Modeling workshops all over the world. About half of the participants have suffered through normalization in school, but virtually none of them – even the data management professionals – can explain normalization in a way that is comprehensible to mere mortals. In fact, they usually can’t even explain it in a way that is incomprehensible. Without burning up a lot of space, let’s agree this is a problem – it might mean that they really don’t understand normalization, and either way, if it can’t be explained easily to non-modelers, it might make some of our modeling decisions seem less rational and more arbitrary.

I think the lack of clarity around normalization stems from how it is usually taught, which goes back to how it was originally defined, as a set of rules about relationships between key and non-key attributes. That makes things more complex than they have to be because it moves the discussion away from a relatively natural discussion of things (entities) and facts about things (attributes and relationships) into some relatively unnatural rules. I hear repeatedly from students that in their very first database or data modeling class at university they were exposed to such thrilling insights as “In Second Normal Form every non-key attribute must be fully functionally dependent on the entire multi-part primary key.” Yikes! This reminds me of why I swore, back in my university days, that I was never going to get mixed up with databases because – they were just too complicated! Mind you, I also swore I’d never work for IBM, but naturally my first job was doing database technical support for IBM. So much for student predictions.

Returning to our topic, I know it’s almost heresy to say so, but choosing primary keys is one of the very last things a modeler should be concerned with. (Hey – another blog post!) Diving right into teaching data modeling with a key-focused discussion of normalization is guaranteed to get things off on the wrong track – instead of modeling a business, the focus is designing a relational database. Let’s look at an alternative.

Normalization made simple
Down. Up. Sideways.

There, that’s the essence of 1st, 2nd, and 3rd normal forms as long as the data model is drawn top-down.

Not enough detail? Okay, we can flesh it out a bit, in layers.

Note that in explaining this, I always demonstrate it first, using easily understood examples like company-issued assets (laptops, phones, keys, …) or students registering in courses, and then introduce the “theory.” For the data folks who read this blog I’ll reverse that to ensure we’re using the same terminology:

  1. If any attribute in an entity could take on multiple, different values, it is multivalued.  If an entity contains any multivalued attributes, it is unnormalized. That is, it is in UNF, or 0NF if you prefer.
  2. When multivalued attributes are moved down in the model, into a new or already existing dependent characteristic or associative entity, the model has been put in First Normal Form (1NF.)
  3. Next we check for redundant attributes, that is, cases where the same attribute value would be recorded more than once, across multiple instances of the same entity. A redundant attribute is not a fact about the entity it’s been placed in, and Second Normal Form and Third Normal Form are all about finding a better home for it. If inspection (and common sense) reveals that a redundant attribute is actually a fact about an entity up the family tree (a parent, grandparent, great-grandparent, …) it is moved there, which might involve adding a new entity to the model. This puts the model in Second Normal Form (2NF.)
  4. The remaining redundant attributes that didn’t belong further up in the model must be facts about related entities off to the side, many of which will be reference entities. Moving those attributes sideways into new or existing related entities puts the model in Third Normal Form (3NF.)

Here’s the condensed version of the three stages of normalization  – multivalued attributes go down into a characteristic or associative entity (1NF,) some redundant attributes go up into the parent (or grandparent or…) entity they are really facts about (2NF,) and other redundant attributes that do not belong up the family tree go sideways into an entity that is not related via a dependent (parent-child) relationship (3NF.) Of course you remember that dependent (parent-child) relationships go up and down, and non-dependent relationships go side to side.

The important points to note here are:

  • It is possible to explain the concepts of normalization, and to produce a normalized model, with no reference whatsoever to keys of any sort (primary, foreign, candidate, or whatever.)
  • The explanation relies only on two, easily demonstrated ideas: there could be multiple values for the same attribute (multivalued attribute,) and the same attribute value could be recorded multiple times (redundant attribute.)

Using the examples from the last post:

First Normal Form

  • If a Worker had multiple values for the attributes Performance Review Date and Performance Review Score, we’d move those down into the Performance Review characteristic entity, an example of 1NF.

  • If a Worker had multiple values for the attribute Benefit Program Enrollment Date, we’d move those down into the Enrollment associative entity, another example of 1NF.

Second Normal Form

  • If an Organization Unit entity contained the attributes Company Name and Company Incorporation Date, they would clearly be redundant – the same values would be recorded for each Organization Unit within a Company. If we moved the attribute up into a new Company entity, that would be an example of 2NF.

Third Normal Form

  • If the Worker entity contained the attribute Employment Type (regular, contract, student intern, etc.) it would be redundant. For instance, the value “contract” would be recorded redundantly for each contract Worker. If we moved the attribute into a new Employment Type entity, that would be an example of 3NF, and is also an example of an attribute that has a constrained set of values.

Your assignment
The preceding examples could have been more detailed, but that’s intentional – before the next post, I’d like you to build an example relevant to your organization that demonstrates the three normal forms. If you want to send it to me for review, all the better – I promise a response. I’ll also provide a more complete example next time.

Next post
Of course, we data folks couldn’t leave well enough alone, and no sooner had Ted Codd introduced the first three normal forms than people started coming up with more – 4NF, 5NF, Boyce-Codd NF, and so on. These can be especially baffling, but, again, there are simple explanations – that’s what we’ll look at next. I’ll also go out on a limb and explain why I’d reduce the first five normal forms to three. More heresy! Come back for more.

Follow all Expert Blog updates by subscribing to the RSS RSS feed.

About the Author

Alec Sharp has managed his consulting and education business, Clariteq Systems Consulting Ltd., for close to 30 years. Serving clients from Ireland to India, and Washington to Wellington, Alec has expertise in a rare combination of fields - data management, business analysis, business process improvement, and enterprise architecture.

There have been no comments yet.

Name:

Email:

Comment:

The color of grass is usually...?

Notify me of follow-up comments?