Alec Sharp

Generalization In General

By Alec Sharp on May 23, 2010
View Full Bio →

Last month, I set out to write about those data modeling soulmates, generalization and recursion. Somehow, the topic ended up being entity naming, which, to my surprise, generated some nice feedback over Twitter. This month, I restarted on generalization and recursion, but again, that’s not quite how things worked out. By the time I’d recapped the basics of generalization, I had more than enough content for a post, with a lot of material on the common pitfalls in generalization left over.  That’s going to be next month’s topic, and maybe after that, I’ll finally get to some of the cool things you can do when generalization is handled properly, and combined with interesting forms of recursion involving supertypes and subtypes.

Generalization in general

This post looks at the basics of generalization, including the concepts of supertypes and subtypes. If you’re reading a data modeling blog, you’re likely to be familiar with this already, but please read on – it might highlight some of the interesting differences of opinion among data modelers on how to best apply these concepts.

Obligatory note on terminology

As explained in my last post, some of you use the terms entity type and entity. In this convention, Bob, Carol, Ted, and Alice are entities that are instances of the Person entity type. I’ll follow the more common convention of referring to entity type as simply entity, and entity instance simply as instances or occurrences.  Bob, Carol, et al are instances of the Person entity. Finally, I’ll simply use supertype rather than supertype entity, and subtype instead of subtype entity.

An overview, using the data modeling world’s most common example

You’ve surely seen the “Party” data structure, which is the most common example of this month’s topic:

 Business Party (or simply “Party,” or alternatively “Legal Entity”) is a supertype, which has two subtypes, Person and Organization. You could also say that the specific entities Person and Organization have been generalized into Business Party. Attributes that are common to both Persons and Organizations go into the supertype, while attributes that are unique (birth date for Person, or organization type for Organization) go in the appropriate subtype. The same is true for relationships.

By the way, please don’t get hung up on the diagramming convention – there are lots to choose from! Your preference, possibly driven by the tool you use, might be the hierarchical version shown on the left. To indicate “subtypes” you might use a semi-circle with an “X” in it, a solid triangle, a circle with one or two lines under it, or something else. And on the relationship lines you might include a single hash mark to indicate cardinality of 1:1, double hash marks to indicate mandatory 1:1, or no hash marks at all. Or, you might avoid all those decisions by following the nicely intuitive “box in box” convention on the right. Both approaches have strengths and weaknesses. And no, I won’t discuss them – all these years in the business have taught me that discussions of notation produce much heat but little illumination.

Generalization 101

When we add an entity to a data model, it’s because there is a set of things that is inherently similar in terms of what they are, what we need to know about them, and possibly how they behave, if you take an O-O slant. An entity is like a template that all instances follow in terms of name, definition, facts (attributes and relationships,) and so on. Examples in a human resources data model include Worker, Organization Unit, Position, Benefit Program, and so on. All Workers have a Worker ID, a Legal Name, a Birth Date, and so on, and can participate in specific relationships (“Worker is assigned to Organization Unit”) so we create the Worker entity as a template for describing all instances. You could even argue (I won’t) that creating a new entity is an example of generalization – we’ve generalized by saying that all instances of the proposed entity conform to the same template.

We specifically add the data modeling technique of generalization to the mix in either of two situations. In the first, different entities seem to conform to the same template. In the second, different entities partially conform to the same template, but have some unique properties as well. Let’s take a closer look at each:

1 – Similarities leading to generalization

In this situation, we realize that two or more different entities are so inherently similar that they are better (more simply and flexibly) handled as a single entity. By “inherently similar, I mean that they are the same kind of thing, and have the same facts (attributes and relationships.) For instance, because they’re the same basic kind of thing, and share facts such as name, primary location, contact points, budget centre code, manager’s position id, and so on, we’ll be better off with a single generalized Organization Unit entity than with separate entities for Division, Department, Section, Team, and so on.

As shown in the example above, Organization Unit will be recursively related to itself to record the organization’s structure, and will probably be classified by the reference entity Organization Unit Type which will indicate if it is a “division,” a “department,” or whatever. There will be less programming overhead to maintain and report data about that single entity, although you might get some flak from developers who don’t want to deal with that recursive relationship. Ignore them – it isn’t that hard, and besides, it’s their job. grin 

There shouldn’t be any argument, though, that the generalized solution offers greater flexibility – when a new type of Organization Unit comes along, which it inevitably will, handling it will be as simple as adding a new entry to Organization Unit Type and using existing application facilities to add the necessary Org Units. The alternative is to get your DBA to create a new table (and everything that entails,) and then create all the program logic needed to maintain and report it.

Here’s a quick example. Many years ago, HR consultants brought in by one of my clients decided the client needed two new types of organization units – “Flower” and “Petal.” (Perhaps “HR” stood for “hippie reborn.”) In any case, I had designed this client’s HR databases with the appropriate generalization, so it was a simple matter to add this new type of unit with no developer or DBA involvement. And remove it six months later, when the client decided that having flowers and petals on the corporate org chart was a really dumb idea.

2 – Similarities and differences leading to a supertype (generalization) and subtypes (specialization)

In this case, we notice that within the scope of an entity there are not just similarities, but also important differences, both of which must be reflected in the model. For instance, all Rail Cars have a manufacture date, an in-service date, an ID, a type, a model designation, and other common attributes, and can participate in the “Rail Car is coupled to Rail Car” relationship. However, only a passenger car has the attributes galley flag and seat count, only a freight car has attributes to record its capacity and a relationship describing the commodity types it is allowed to carry, and only a locomotive has a horsepower rating, driven wheels count, and fuel type. We create a supertype entity called Rail Car to record the common attributes and relationships, and subtype entities Passenger Rail Car, Freight Rail Car, and Locomotive Rail Car to record attributes and relationships that are specific to the subtype.

Here’s an example using slightly different notation. As with the previous example, don’t get hung up on details like attribute naming, handling multi-valued attributes, showing foreign keys, and so on – it’s the concept we’re worried about here.

By the way, I recognize that the four-engine, self-powered Budd passenger cars I was a steward on as a young man had some properties of both Passenger and Locomotive cars, but if we get into multiple inheritance I’ll never finish the column, and that’s happened once already.

Summarizing the two cases

We can summarize by saying that sometimes (case 1, the Organization Unit) you generalize into a supertype with no unique attributes left behind in subtypes, and other times (case 2, the Rail Car) you generalize into a supertype while leaving unique facts behind in two or more subtypes. Your preferred terminology might be generalization-specialization, or simply gen-spec.

Constraining the use of subtypes

You probably agreed with most of the preceding, but experience tells me the next point is where confusion among new modelers and disagreement among experienced modelers often arises. Here goes… with rare exceptions, I’ve found supertype-subtype structures to be useful only when two criteria exist:

  1. The subtypes are mutually exclusive. That is, an instance of Rail Car can only be one of the subtypes (again, we’ll set aside multiple inheritance.)
  2. The subtype is mandatory. That is, a Rail Car must be either a Passenger, Freight, or Locomotive car – there is no instance that is just “a rail car.” This is sometimes described by saying Rail Car is an abstract concept, because it is a generalization that doesn’t actually exist, while the subtypes like a Passenger Rail Car are concrete concepts, because they actually exist.

When optional subtypes and overlapping (non-mutually exclusive) subtypes are added to the mix, fuzzy thinking comes along for the ride, and an inaccurate, restrictive model is the frequent result. That’s because if some properties of an entity are optional and overlapping, they’re probably multi-valued as well, so subtyping is not the appropriate modeling construct to use. Rather, these should be handled just like any other optional, overlapping, multi-valued property – as a dependent characteristic (or attributive) entity.

Looking ahead

That last point is one of the examples of incorrect generalization, supertyping, and subtyping we’ll look at next month. Here’s the full list as it stands right now:

1 – Failure to generalize, a.k.a. literalism

2 – Generalizing too much

3 – Generalizing too soon

4 – Confusing subtypes with roles, states, or other multi-valued characteristics

5 – Applying subtyping to the wrong entity .

With your suggestions, perhaps it can grow into The Seven Deadly Sins of Generalization, or even a Top Ten list. I’ll look forward to hearing any ideas you have, or any war stories about problems caused by incorrect handling of this important subject.

Follow all Expert Blog updates by subscribing to the RSS RSS feed.

About the Author

Alec Sharp has managed his consulting and education business, Clariteq Systems Consulting Ltd., for close to 30 years. Serving clients from Ireland to India, and Washington to Wellington, Alec has expertise in a rare combination of fields - data management, business analysis, business process improvement, and enterprise architecture.

Neil Raden
May 23, 2010

I think that data modeling is a futile attempt to make sense out of nonsense. But what else can we do? Alec certainly does it elegantly.

Alec Sharp
May 25, 2010

Hi Neil -
Great to see you here. Now that I’ve parsed your comment I think I know what to say.  grin  Describing data modeling as “making sense out of nonsense” is a pretty good way to describe, although I might have chosen something more like “bringing order - or at least, *some* order - to a chaotic domain.” I’ve had pretty good luck with that. I think the elegance (thank you!) comes in recognizing that the quest for perfection and the single canonical view is futile, and that it’s better to do something that’s good enough to make the situation better than it was before you started. That’s why my unofficial corporate motto is “GEFN” - “good enough for now.” “Close enough for government work” is another good variation, or, as a woman I used to work with said, “Good enough for the guys we date.”
Cheers,
Alec

Name:

Email:

Comment:

3 plus 11 is equal to?

Notify me of follow-up comments?