Posts From This Author
About Our Authors
Does Data Modeling Have a Cloudy Future?
By Malcolm Chisholm on March 29, 2010View Full Bio →
Cloud computing seems to be on everybody's mind these days. And there seem to be a lot of perspectives on just what it is. For some it is cheap infrastructure, for others a way to distribute data cheaply, and for a few it is about a new class of databases. There are other perspectives too, but the one that interests me is the new class of databases - and what the implications are for data modeling.
The new class of databases is called "columnar databases". The reason that they are associated with the Cloud is that they are intended to manage data at ultra-large scale. Right now that means petabyte (1015 bytes) and exabyte (1018 bytes) volumes of data. Only the Cloud hardware and operating system environment can provide capacity to house such volumes at an acceptable cost. Hence the link between columnar databases and the Cloud. There is nothing to prevent columnar databases being implemented for smaller data volumes, but they really shine at ultra-large scale.
What is A Columnar Database?
So what is a columnar database? The problem with answering this question is that you have to unlearn the relational paradigm to be able to appreciate how the things work. For thirty years, the relational paradigm has dominated data management, and for many data modelers it is all they know.
That said, a columnar database consists of tables that have rows that are defined by a Row Id with a common structure. It is not the record that has a common structure - as in relational - but just the row Id, which may be one or a few concatenated fields. The row can have an (almost) arbitrary number of columns. That is, within a given table, one row can be associated with one column, or perhaps thousands of columns. In theory it is possible that no two rows might have exactly the same mix of columns.
Now you know why you have to unlearn the relational paradigm to understand this stuff. It gets even stranger, but we will concentrate on the basics for now.
When you read a columnar database you do not get back a record in the sense of a whole set of columns. You only get one back. Imagine a relational table, but in a situation where a SQL SELECT could only give you the surrogate key plus one extra column. In other words you could only read a selected column "stripe" at a time.
At this point you may be wondering why anyone would use these strange beasts. The answer is that they can answer queries with amazing performance at ultra-large scale. They lie behind Google, Linked In, Facebook, and others. The success of these companies is rather strong proof of the power of columnar databases. I think it means they will be with us for some time. The challenge for data modeling will be how to represent them, and it will be interesting to see how that will be done.
Follow all Expert Blog updates by subscribing to the
RSS feed.
About the Author
Malcolm Chisholm, Ph.D. has over 25 years of experience in enterprise information management and data management and has worked in a wide range of sectors. He specializes in setting up and developing enterprise information management units, master data management, and business rules.
Thanks for the great article, Malcolm.
I’m wondering what the best way is to handle metadata with columnar databases. As you mention, it breaks our traditional relational paradigm of where a single column can be rationalized and defined. Does a single definition of “customer last name”, for example, still have a place in the design of these systems? How is it documented and stored?
April 1, 2010
Donna - I think the major problem is that data is almost necessarily stored redundantly many times in the columnar databases, due to the need to reformat it to answer specific queries, and to create inverted indexes. The challenge is not only to have a definition of a field, but also to know where this field is stored. Right now there is very little support either to model this or to maintain the metadata in a repository - Cloud databases do not have system catalogs as we are used to them.





















March 29, 2010