Scala For Quants: November 2011

Someone once said that there are no difficult concepts, just bad explanations. This appears to apply to the Cassandra data model and for this reason there are plenty of explanations out there including "WTF is a supercolumn" by Arin Sarkissian. They have all added to my confusion. Some of them even give the impression that a supercolumn is some kind of super-concept and that doesn't appear to be the case. Indeed we should probably modify the adage. There are no difficult concepts, just really horrible naming conventions.

Allegedly Conceptual Prerequisites

1. A cell: (name, value, timestamp)

In my new naming convention a cell is a three tuple where the value is something we care about, the name is thought of as an index (row or column if you like) and the timestamp is something Cassandra cares about more than you typically would (synchonicity issues, et cetera). Now a three-tuple comprising name, value and timestamp is a straightforward thing unless of course you call it a column, which is the official terminology. Why would you call a single instance of a three tuple a column instead of a cell? I don't really know. Please forget I said "column", and try not to think about a column of soldiers

2. A list of cells {(name, value, timestamp), (name, value, timestamp)}

Self explanatory, unless you call it a column family. Since timestamps fade into the background you can think of it as a map from name to value.

3. A named list of cells (name, <list>)

Self explanatory, unless you call it a super-column.

Official Cassandra Terminology (Warning: Very Dumb)

Well that was easy. Unfortunately you might need to communicate with Cassandra, Cassandra documentation, or Cassandra fans so we are not even half way there. So I'll do my best to describe the essential terminology even though, if the glossary is anything to go by, it isn't entirely clear this has been agreed upon.

Definition: Column

Naively reading the documentation you would be forgiven for thinking that a column is a lone, solitary three tuple (name, value, timestamp) because that is what the documentation says and even more remarkably, that is what a column is. A column is a cell. This terminology might be logical, I suppose, in the sense that it specifies a column but actually that is not the intent or nor does a single cell comprise a column in the usual sense. Specification versus content might be my confusion and the documentation flicks effortlessly between data representation and data content as we continue to column families, and so forth. But I think we end up getting the joke. A column is a solitary three tuple just like a solitary soldier comprising a column.

Definition: Column family {(name, value, timestamp), (name, value, timestamp), ... }

A column family is what I referred to above as a list of cells, thus defining a list of what might be interpreted as column names (following the previous lack-of-definition to the word). But NO, that isn't the intent ... read on.

Definition: SuperColumn {(name, <list (i.e. map)>)}

A super column is like a column (i.e., a lone tuple, let's get that straight) but its value is actually a list of cells. Oh and it doens't have a time stamp so describing it as "like" a column seems a bit daft. There is no recursion to speak of as it turns out, despite this taunt.

It seems that a SuperColumn could be used to represent something like a column in a relational database table (a named vector of values), especially if you don't care for the name's in the map (they might be left blank). On the other hand, it is more like a row in a database in most uses of maps of maps (see below) making the terminology even dumber. Confused? Read on.

Definition: Column family

Next we learn that columns are "organized" into a column family comprising an ordered list of columns. If you are thinking that they might therefore represent how a table might be defined, don't. Perhaps though, if we were to represent something analogous to the contents of a table in a database we would need a super column family, right? Hmm, not really. But let's proceed.

Definition: SuperColumn family

Now column families, we are assured, can be either standard or "super" - just like petrol in Australia. So what is a super column family and is it analogous to the contents of a database table (in some special cases)? We may never know because a super column family is conspicuously absent from the official glossary and rumor is it is deprecated. Turning to wikipedia we confirm our suspicion that a super column family is a ... wait for it ... a tuple that consists of a key-value pair where the key is mapped to a value that are column families. That seems to be struggling towards grammatical and logical sense, but I still think that Cassandra is all about inferring the design, not just reading about it in the documentation - which would be boring.

De-mystifying Cassandra Terminology?

Now maybe that is a bit unkind and this description of column families (collections of cells) is a little more illuminating. It is clear what the design is, though not whether anyone can describe it carefully using official terminology to the letter (because that confuses the data structure with the data contents, columns with rows, et cetera). No wonder this is considered an advanced topic even for veteran database administrators. Or perhaps it is an advanced topic only for veteran database administrators and only if you use terminology attempting sloppy adherence to database terminology. Or maybe it is a really advanced topic if you deliberately use database terminology but confuse columns with cells?

Example: A static column family

Remember what a column family is? I've forgotten due to the apparent ambiguity between singular and plural infecting everyone who has written about Cassandra. But the use case intent is seemingly self-evident. Here for example the first "row" (whatever that is) has the same names as the second, although some are missing in the third. The layout is clear, unless you try to map it back to the definitions of column family and super column family that make no f@#ing sense whatsoever. And where exactly did the term "row" get defined, you ask? Well a row is a sorted map that matches column names to column names, according to the ever helpful glossary. It has a name (a row key) and a list of columns (i.e. cells) so it would seem to be ... yes ... a SuperColumn!).

Now hang on a second. Let's read that more carefully.

In a Column Family, a Row is a sorted map that matches column names to column values.

I'm inferring now that a single column family comprises all the green stuff in this picture (and not just one column, or row, say - using the usual meaning of "row" and "column") so that a row can be "in" a Column Family. Is this the correct usage of the terminology? I really don't know, and I'm pretty sure the following won't help us:

In a Super Column, a Row is a sorted map that matches Super Column names to maps matching Column names to Column values. Just to refresh your memory, a super column is a named list of cells whereas a column family is an unnamed list of cells. Does that help?

Example: A dynamic column family

Terminology aside we have the trivial generalization of the data structure to the case where the names need not coincide at all. Again, the intent is simple but what am I looking at? The adjective "dynamic" is applied to the collection, presumably, which implies that there is only one column family in the picture, but the "collection" we see is definitely not a single column family because a column family comprises a single list.

It is a plane after all

You'll forgive me for concluding that a lot of careless circular crap has been written about the Cassandra data model and if the authors had been responsible for any part of pure mathematics we'd all be well and truly screwed. Still, reading this documentation (which may or may not conflict with the "definitions" provided elsewhere) bolsters the belief (and I have to call it that) that a column family comprises all the green cells in the picture. It is simply a collection of cells containing whose repetition of names implies a layout (in either direction). Thus:

A column family resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp.

Now even if we pander to the dumb-ass terminology "column" one has to read this very consciously and recognize that "column" and "row" are very, very different. One is a single cell and one is a map. Repeat after me. A column is a single cell. A row is a map which makes explicit the implied structure in the picture where the columns (cells) have been laid out in a plane and the names used to achieve allignment. See how I avoided the common use of "row" there to describe the rows - doh - I mean horizontal sections of green things.

You see it all makes complete sense:

A column is the most basic unit of representation in the Cassandra data model. A column is a triplet of a name (sometimes referred to as a "key"), a value, and a timestamp.

AARRRGGGHHHHH! For the love of god could somebody out there please stop pretending we are all "wrestling" with the data model concept (and not use phrases like "think of it as ...") and just define the fucking thing with remotely sensibly terminology?

Welcome to Scala for Quants, the sporadic rants of a newbie. As for the language, Scala By Example is Odersky's nice introduction. The Scala Tour is a good primer on traits amongst other things which will get OO people off and running if you don't care for the functional aspects. (Traits are the major, or minor, departure from Java and C++ depending on how you look at it - justified here in the context of the long running inheritance debate).

Scala is pretty easy to read for the most part but what tripped me up at first was the non-alphanumeric syntax. My expanding cheat sheet:

1. On some uses of =>

()=>B, =>B define functions with no arguments equivalent to Function0[B]
x: =>A denotes a call-by-name parameter when in a function (see pass-by-name), or simply a declaration of a variable x returning A that must be called without parentheses. It is the same thing.
case 0=>, case i:Int =>, case Foo(a,b)=>, case Foo(a,b) if a==b => appear in match statements including those pattern matching case classes.
import java.util{Date=>UDate} imports and renames java.util.Date.

2. On some uses of _

"I don't really have a good conception of what wildcards are" - Martin Odersky (interview)

import.java_util{Date=>_} will exclude a symbol from import
case _ is the default in a match statement (c.f. "otherwise")
Characters +,-,! and ~ can be used as prefix operators in any class by defining unary_+ etc
var y:Int = _ is a default initial value. Without the wildcard you will be declaring an abstract variable (see abstract types in the Scala Tour)
case <body>{b @ _*}</body> => is an example of matching multiple elements in an case statement embedding XML (see Jim McBeath's primer)
_3 can be used to access the 3'rd element of a tuple, for example x._1
Existential types. A wildcard in a type binds to the closest enclosing type application (see scala-lang) so that List[_] is equivalent, for example, to List[T] forSome {type T }.

3. On some uses of :, ::, :::

In declarations. var a: Int = 7, val n = 5, val n:Int define mutable, immutable and abstract variables respectively.
These also may be used in match statements, as in case: i:Int, and function declarations, as in def(i:Int)
x: =>A denotes a call-by-name parameter when in a function (see pass-by-name), or simply a declaration of a variable x returning A that must be called without parentheses. It is the same thing.
1::List(2,3) is the cons operator that prepends and element to a list, as explained in scala list examples
oneList:::twoList is list concatenation

4. On some uses of <-, <:, <%, :>

for (n <- 0 to 6; e = n%2; if e==0 ) yield n*n
for ((a,b) <- list) yield a+b
for (x <- 0 to 4; y <- 0 until 3) yield (x,y)
class Set[A <: Ordered[A]] is an upper bound
trait Set[ A <% Ordered[A]] is a (weaker) view bound indicating that A must be convertible by an implicit conversion

5. On some uses of %

T<%U means type U is a view bound for T, as above.

6. On use of /:

(0/:list)(_+_) is foldLeft

7. On use of *

def printf(format:String, args:Any*): String is an example of the use of Vararg parameters

8. On some uses of +

class Stack[+A] { indicates that subtyping is covariant in that parameter

9. On some uses of #::, #:::

fibs: Stream[BigInt] = BigInt(1) #:: BigInt(2) is an alternate way to build streams (see Stream docs) consWrapper is an implicit def adding #:: for cons and #::: for concat

References & further reading

In this reminder of the syntax I'm borrowing heavily from this already terse syntax primer by Jim McBeath. As Jim notes, symbol names can use :, \, *, +, ~ and pretty much whatever you want, but the precedence is defined by the first character. See also the Scala cheat sheet and Scala API.

Scala For Quants

Tuesday, November 15, 2011

Its a Row, Its a Plane, Its a SuperColumn! The excellent Cassandra data model and its not so excellent terminology

Tuesday, November 8, 2011

Reading scala's non-alphanumeric syntax