Scala For Quants: Its a Row, Its a Plane, Its a SuperColumn! The excellent Cassandra data model and its not so excellent terminology

Someone once said that there are no difficult concepts, just bad explanations. This appears to apply to the Cassandra data model and for this reason there are plenty of explanations out there including "WTF is a supercolumn" by Arin Sarkissian. They have all added to my confusion. Some of them even give the impression that a supercolumn is some kind of super-concept and that doesn't appear to be the case. Indeed we should probably modify the adage. There are no difficult concepts, just really horrible naming conventions.

Allegedly Conceptual Prerequisites

1. A cell: (name, value, timestamp)

In my new naming convention a cell is a three tuple where the value is something we care about, the name is thought of as an index (row or column if you like) and the timestamp is something Cassandra cares about more than you typically would (synchonicity issues, et cetera). Now a three-tuple comprising name, value and timestamp is a straightforward thing unless of course you call it a column, which is the official terminology. Why would you call a single instance of a three tuple a column instead of a cell? I don't really know. Please forget I said "column", and try not to think about a column of soldiers

2. A list of cells {(name, value, timestamp), (name, value, timestamp)}

Self explanatory, unless you call it a column family. Since timestamps fade into the background you can think of it as a map from name to value.

3. A named list of cells (name, <list>)

Self explanatory, unless you call it a super-column.

Official Cassandra Terminology (Warning: Very Dumb)

Well that was easy. Unfortunately you might need to communicate with Cassandra, Cassandra documentation, or Cassandra fans so we are not even half way there. So I'll do my best to describe the essential terminology even though, if the glossary is anything to go by, it isn't entirely clear this has been agreed upon.

Definition: Column

Naively reading the documentation you would be forgiven for thinking that a column is a lone, solitary three tuple (name, value, timestamp) because that is what the documentation says and even more remarkably, that is what a column is. A column is a cell. This terminology might be logical, I suppose, in the sense that it specifies a column but actually that is not the intent or nor does a single cell comprise a column in the usual sense. Specification versus content might be my confusion and the documentation flicks effortlessly between data representation and data content as we continue to column families, and so forth. But I think we end up getting the joke. A column is a solitary three tuple just like a solitary soldier comprising a column.

Definition: Column family {(name, value, timestamp), (name, value, timestamp), ... }

A column family is what I referred to above as a list of cells, thus defining a list of what might be interpreted as column names (following the previous lack-of-definition to the word). But NO, that isn't the intent ... read on.

Definition: SuperColumn {(name, <list (i.e. map)>)}

A super column is like a column (i.e., a lone tuple, let's get that straight) but its value is actually a list of cells. Oh and it doens't have a time stamp so describing it as "like" a column seems a bit daft. There is no recursion to speak of as it turns out, despite this taunt.

It seems that a SuperColumn could be used to represent something like a column in a relational database table (a named vector of values), especially if you don't care for the name's in the map (they might be left blank). On the other hand, it is more like a row in a database in most uses of maps of maps (see below) making the terminology even dumber. Confused? Read on.

Definition: Column family

Next we learn that columns are "organized" into a column family comprising an ordered list of columns. If you are thinking that they might therefore represent how a table might be defined, don't. Perhaps though, if we were to represent something analogous to the contents of a table in a database we would need a super column family, right? Hmm, not really. But let's proceed.

Definition: SuperColumn family

Now column families, we are assured, can be either standard or "super" - just like petrol in Australia. So what is a super column family and is it analogous to the contents of a database table (in some special cases)? We may never know because a super column family is conspicuously absent from the official glossary and rumor is it is deprecated. Turning to wikipedia we confirm our suspicion that a super column family is a ... wait for it ... a tuple that consists of a key-value pair where the key is mapped to a value that are column families. That seems to be struggling towards grammatical and logical sense, but I still think that Cassandra is all about inferring the design, not just reading about it in the documentation - which would be boring.

De-mystifying Cassandra Terminology?

Now maybe that is a bit unkind and this description of column families (collections of cells) is a little more illuminating. It is clear what the design is, though not whether anyone can describe it carefully using official terminology to the letter (because that confuses the data structure with the data contents, columns with rows, et cetera). No wonder this is considered an advanced topic even for veteran database administrators. Or perhaps it is an advanced topic only for veteran database administrators and only if you use terminology attempting sloppy adherence to database terminology. Or maybe it is a really advanced topic if you deliberately use database terminology but confuse columns with cells?

Example: A static column family

Remember what a column family is? I've forgotten due to the apparent ambiguity between singular and plural infecting everyone who has written about Cassandra. But the use case intent is seemingly self-evident. Here for example the first "row" (whatever that is) has the same names as the second, although some are missing in the third. The layout is clear, unless you try to map it back to the definitions of column family and super column family that make no f@#ing sense whatsoever. And where exactly did the term "row" get defined, you ask? Well a row is a sorted map that matches column names to column names, according to the ever helpful glossary. It has a name (a row key) and a list of columns (i.e. cells) so it would seem to be ... yes ... a SuperColumn!).

Now hang on a second. Let's read that more carefully.

In a Column Family, a Row is a sorted map that matches column names to column values.

I'm inferring now that a single column family comprises all the green stuff in this picture (and not just one column, or row, say - using the usual meaning of "row" and "column") so that a row can be "in" a Column Family. Is this the correct usage of the terminology? I really don't know, and I'm pretty sure the following won't help us:

In a Super Column, a Row is a sorted map that matches Super Column names to maps matching Column names to Column values. Just to refresh your memory, a super column is a named list of cells whereas a column family is an unnamed list of cells. Does that help?

Example: A dynamic column family

Terminology aside we have the trivial generalization of the data structure to the case where the names need not coincide at all. Again, the intent is simple but what am I looking at? The adjective "dynamic" is applied to the collection, presumably, which implies that there is only one column family in the picture, but the "collection" we see is definitely not a single column family because a column family comprises a single list.

It is a plane after all

You'll forgive me for concluding that a lot of careless circular crap has been written about the Cassandra data model and if the authors had been responsible for any part of pure mathematics we'd all be well and truly screwed. Still, reading this documentation (which may or may not conflict with the "definitions" provided elsewhere) bolsters the belief (and I have to call it that) that a column family comprises all the green cells in the picture. It is simply a collection of cells containing whose repetition of names implies a layout (in either direction). Thus:

A column family resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp.

Now even if we pander to the dumb-ass terminology "column" one has to read this very consciously and recognize that "column" and "row" are very, very different. One is a single cell and one is a map. Repeat after me. A column is a single cell. A row is a map which makes explicit the implied structure in the picture where the columns (cells) have been laid out in a plane and the names used to achieve allignment. See how I avoided the common use of "row" there to describe the rows - doh - I mean horizontal sections of green things.

You see it all makes complete sense:

A column is the most basic unit of representation in the Cassandra data model. A column is a triplet of a name (sometimes referred to as a "key"), a value, and a timestamp.

AARRRGGGHHHHH! For the love of god could somebody out there please stop pretending we are all "wrestling" with the data model concept (and not use phrases like "think of it as ...") and just define the fucking thing with remotely sensibly terminology?

Scala For Quants

Tuesday, November 15, 2011

Its a Row, Its a Plane, Its a SuperColumn! The excellent Cassandra data model and its not so excellent terminology

No comments:

Post a Comment