Alec Sharp

Five Ways to Go Wrong with Generalization

By Alec Sharp on June 23, 2010
View Full Bio →

Last month, after a few false starts, I managed to produce a post on generalization, a data modeling technique that causes difficulty for both beginning and experienced data modelers. The problems they build into their models are different, but what’s similar is that they’re typically unaware of them. Let’s shine a spotlight on the most common errors made with generalization so they can be avoided in future.

Five ways to go wrong with generalization

Here’s the list of typical problems in generalization that we closed last month’s post with:

1 – Failure to generalize, a.k.a. literalism

2 – Generalizing too much

3 – Generalizing too soon

4 – Confusing subtypes with roles, states, or other multi-valued characteristics

5 – Applying subtyping to the wrong entity.


1 – Failure to generalize, a.k.a. literalism

The first way you could go wrong with generalization, and the one most often seen in models developed by junior or non-modelers, is simply to not do it. This is literalism, which could also be called overspecialization or undergeneralization – not generalizing even though it’s clearly the appropriate thing to do.  Consider a snippet of the lyrics from that old spiritual, “Dry Bones” –

The foot bone connected to the leg bone,

The leg bone connected to the knee bone,

The knee bone connected to the thigh bone,

The thigh bone connected to the back bone, and so on…

If you want anatomical accuracy, we could substitute tarsal bone, fibula, tibia, patella, femur, and so on. That wouldn’t change the problem, though, which is that an inexperienced modeler will translate this literally (hence the term) into an undergeneralized model in which every type of bone is represented as a distinct entity.

 

This might seem far-fetched, but I’ve often seen exactly that situation in production data structures that recorded topics such as product composition or organizational structure. The worst cases invariably involve a fixed (i.e., rigid or invariant) hierarchy. In one memorable case, the database supporting a manufacturing control system included a seven-level hierarchy: system, subsystem, assembly, subassembly, component, device, and part. (I think that was it – I’m going from memory.) Each was recorded in its own entity, connected to the next lower level via a 1:M relationship. This worked well until a new type of product required a new level somewhere in the middle of the hierarchy. Naturally, this wasn’t realized until the new product was due to go into production, and the launch team discovered it couldn’t be manufactured because of the data structure limitation. A major reprogramming effort was required, delaying the launch of the product for close to a year, and resulting, I was told, in the loss of hundreds of millions of dollars in revenue. Some sort of generalization, as illustrated on the right, would have been preferable.

Think about it for a while, and see if you are able to recall similar (although perhaps not as damaging) examples from your experience such as fixed organizational, geographic, or marketing hierarchies.

2 – Overgeneralization.

At the other end of the spectrum is overgeneralization, or, as I often describe it, “showing off.” If literalism can be expected from junior modelers, then overgeneralization can be expected from some (not all!) experienced modelers. Especially if they’ve never actually had a model used in the design of a production database! More on that in a moment, but first we need an example.

The case that comes to mind occurred at the Driver Licensing section of a state motor vehicle department. Many details had to be recorded about a Driver License Holder, including their various names and addresses, physical characteristics such as height, weight, or eye colour, and citations that had been issued against them. (Yes, I realize that in the spirit of a post on generalization these might actually be facts about a “Person” but let’s leave that argument aside for now.)

In this case, a data modeling consultant from one of the major database vendors successfully proposed a highly generalized structure in which names, addresses, physical characteristics, citations, and so on were all handled in a single generalized “Characteristic” entity. Management was thrilled, because this new structure was going to insulate them from the constant changes that the state legislature generated. Hooray! All was good until their developers had to write code against this ultra-generalized structure, at which point the lights dimmed. It turned out that the “Characteristic” entity was simple, but alongside it was a complex structure for capturing attribute names, attribute values, data types, validation rules, and so on. This part hadn’t been shown to IT management, or anyone else, and the supposedly simple data structure resulted in program logic that was excruciatingly complex. Whoops!

A little further investigation revealed that the data modeling consultant behind the generalized structure had never actually taken a data model through to implementation – he had been an instructor for 25 years, but had never actually implemented anything! He had, however, left a trail of frustration behind him.

That’s a good story, but here’s another that will make the same point…

I once made another well-known data modeler quite upset when I highlighted the perils of overgeneralization during a conference presentation. As usual with this fellow, he went directly into react mode, bypassing the listening state, and was quite upset by the time he got to me after the presentation. He heatedly accused me of undermining his current consulting engagement, during which he had recommended a generalized data structure to handle new equipment types on a manufacturing shop floor. His client, who was sitting next to him during my presentation, disagreed. Of course, the modeler had missed the relevant point I’d made – generalized data structures are absolutely essential when dealing with unpredictable new items. Ironically, the example I used was shop floor equipment where the descriptive attributes for a new type of machine are unpredictable, and the only way to handle them (short of creating a new entity for each new type of machine – literalism) is with a generalized data structure. My point was simply that generalization shouldn’t be employed when there are well understood, frequent, and highly predictable needs that can be handled in their own entity types.

3 – Don’t generalize too soon.

Another problem that seems to arise more frequently with experienced modelers is with generalizing too early in the modeling process, a phenomenon which is nicely illustrated during one of the exercises in my Advanced Data Modeling course. The exercise is based on a hotel catering department in which clients can make a booking for a Function. Associated with the Function are requests for one or more rooms, one or more food deliveries, and one or more pieces of equipment (such as a projector or a whiteboard.)

Invariably, the more experienced modelers in the class immediately generalize these three concepts into a single entity which they call something like “Service” while the less experienced modelers create separate Room, Menu Item, and Equipment Item entities. This is a better course of action for two reasons. First, premature generalization will negatively impact the participation of subject matter experts because a “Service” isn’t a familiar concept like Room or Equipment Item. Second, because they won’t be dealing with familiar ideas, the business experts are less likely to identify facts (attributes or relationships) that are specific to Room, Equipment Item, or Menu Item.

Even when I’m certain that entities are good candidates for generalization, my strategy is always to leave them separate until the similarities are so evident that someone from the business side points it our. I might “facilitate” this awareness (“manipulate” might be more accurate) by ensuring that the entities are drawn alongside one another, with similar relationships drawn symmetrically, and similar attributes listed in the same order. Just being helpful!

4 –Confusing subtypes with roles, states, or other multi-valued characteristics.

It used to be a common suggestion that the various states (statuses) of an entity be recorded as subtypes, the reasoning being that different states involve different attributes, and that states were mutually exclusive. For instance, a Driver License could have states such as Applied, Active, Suspended, Expired, and so on. However, even though a Driver License can have only one state at a time, they will almost always have multiple states over time, and we will want a permanent record of each of them along with facts like the effective date, end date, reason, and so on. That being the case, the best way to model this is with a characteristic (dependent, multivalued) entity called Driver License State.

Similarly, I’ve seen roles mistakenly handled as subtypes, including by that data modeling consultant I mentioned earlier who had never implemented anything. In one of his models, a Person could take on the roles of Driver License Holder, Vehicle Owner, Employee, and so on, but these are definitely not mutually exclusive. Again, a characteristic entity for each of these highly predictable roles is the way to go. My good friend Sally Bean, a UK-based Enterprise Architecture consultant, has a nice way of differentiating the two – “Subtypes are about what I am; roles are about what I do.” That plays into our final example of problems with generalization.

5 – Don’t subtype the wrong entity.

This surprisingly frequent and often subtle error happens when an entity is subtyped when it should be a related characteristic or associative entity that is subtyped. In fact, this is done in an ERwin tutorial I found online. As illustrated below, Employee is subtyped into Manager and Associate. This seems quite sensible until you realize that an Employee could be an Associate for some period of time, be promoted to Manager for another period of time, and then move back into an Associate position. So, it isn’t actually the Employee that has subtypes, because Manager or Associate aren’t inherent properties of an Employee. What should be subtyped is probably the Position to which the Employee can be assigned via an associative entity such as Position Assignment.

A similar example has Employee with subtypes such as Regular, Contract, Student Intern, etc. Again, it isn’t the Employee that should be subtyped, but the characteristic entity Employment Term. Per Sally’s guideline, both of these examples confused “what I am” with “what I do” (or “what is done to me.”)

I’d love to hear from you about problems you’ve seen with generalization – maybe I’ll be able to wring another post from this topic!

Follow all Expert Blog updates by subscribing to the RSS RSS feed.

About the Author

Alec Sharp has managed his consulting and education business, Clariteq Systems Consulting Ltd., for close to 30 years. Serving clients from Ireland to India, and Washington to Wellington, Alec has expertise in a rare combination of fields - data management, business analysis, business process improvement, and enterprise architecture.

Karen Lopez
November 16, 2010

I’m with you on all these. Recommending the proper level of generalization and the right time to generalize must come from a thorough understanding of the physical implementation trade-offs.

It’s nice when modelers want to propose a 27-level subtyping scheme that would make any taxonomist tear up at the beauty of it all, but trying to implement that in a production system that makes sense to the business is near impossible.  In fact, I’d say it violates the whole reason we are here to create models: to architect data in a way that meets the needs of the business.

I have “literalized” structures that might be changed in the future because the performance benefits were huge. The context of this model was that if intervening levels were to be introduced, it meant that the industry as a whole had adopted a change and that business processes would also be greatly impacted.  For me, that was a clear indication that gaining the performance benefits were worth the possible future risk of needing to make a change.

Mike Gorman
March 25, 2011

Alec:

It’s sad that I only noticed these posts in March 2011. As a generalization they’re quite good and very practical. I’m hoping that in reading on through time you address Bills-of-Materials. I’ve found BOMs quite useful for intersecting what is “possible to be done” versus “what is actually done.” Such would be the case in case/evidence management scenarios.

Again, great posts.

Mike Gorman

Alec Sharp
March 28, 2011

Thanks for another comment, Mike, and thanks for the generalization in my post on generalization. Nice piece of recursion!  grin

I’ll think about that post on BOMs - like you, I use them all the time. Some of the later posts on recursion get into the subject a bit.

Cheers,
Alec

Mike Gorman
March 28, 2011

I have a BOM training app if you want to see it. The big killer is getting the user to understand that what they see unfold may not actually exist as real-honest-to-goodness records that, with reporting tools like Crystal Report cannot be adequately “printed out.”

Regards one more time,

Mike G

Name:

Email:

Comment:

The color of grass is usually...?

Notify me of follow-up comments?