Scala For Quants: 2011

Wednesday, December 7, 2011

Are Monads really Monads? Maybe(Some(R))

This article is written for people who, like myself, have stumbled upon monads in functional programming and use them all the time, but wondered why their definition isn't entirely obvious from the definition of a monad in mathematics, specifically category theory. I personally made rather slow progress on this front but some day will understand why arrows are strong monads ... perhaps. At that future moment diagrams like this will make complete sense upon first inspection.

We're operating at a much lower level here, but I hope I can make plain something of the connection between the monad pattern in programming (which looks kinda simple) and the monad definition in mathematics (which really doesn't). Newbies like your author have leap to the conclusion that a monad as an endofuntorial elephant - but please don't do the same because there is enough confusion out there already.

For example, take twenty random blog articles about monads and try to find one that states which category or categories we are in. It is like finding descriptions of "Germany" that include Nazi's, white sausages and so forth and yet never mention that Germany is a country. You kinda sorta get it but at some point ask "hang on a second, is Germany an ethnic group or a lifestyle?" That's what pissed me off just enough to write this, the world's 3248'th blog article about monads.

So let me state this first: a monad (programming speak) is a monad (math speak) in the category of types. If you get nothing more from this article so be it, though if you race to the definition of a monad on Wikipedia you might start wondering why a monad is Haskell isn't actually a functor (in the hierarchy). You might also wonder, quite reasonably, if monads in programming can't be other things too. Heunen and Jacobs explain why both monads and arrows (programming speak) are monoids (math speak).

Introducing MAP, a built-in functor

Back on planet Earth one use of a monad pattern emerges from an everyday programming need: the desire to extend the number system (or some other type) to include an exception, or 'missing' value. This is achieved by creating a new type Option[something]. When you alter a Double so that it is an Option[Double] you are creating an obvious mapping from one type to another, you might even say "the" obvious mapping. But you are not merely mapping objects to Option[objects] - at least not if you wish to get any work done. You are also dragging along all the nice things you'd like to do to the objects, namely hit them with functions. Your dragging is a functor.

A functor is not just a map from one space to another, we recall, but also a map of all the stuff-you-can-do-to-things in that space to stuff-you-can-do-to-things in the other space. To be more precise I should say we are dragging over not just functions but maybe some other stuff like relations, but that is a fine point for now. In the case of Double => Option[Double] you drag f(x) = x*x+3, say, into a function we might call Some(f) which should act on elements of Option[Double] and take Some(x) to Some(x*x+3).

Or take Double and shove it in Mathematica, metaphorically. It is no longer a Double, but merely an expression, like x*x+3 which is yet to be bound to a specific Double. Thing is, you'd like to do to the expressions what you do to real numbers, so evidently we are dragging morphisms like the one taking 3 into 9 and 5 into 15 over to morphisms on expressions (like the one taking x into 3*x). They ain't the same thing but there is clearly some conservation of structure as we go from real, tangible Double into ephemeral 'x'.

For more examples of "obvious" mappings take a bunch of program fragments and shove them together in a combinator pattern. Or play with state machines, whose composition can be simplified to a single state machine. The notion of an incomplete calculation, an unconsumated something, an unsimplified something, or an extension of a domain are all examples where there is a obvious, rich mapping from one type to another. So rich, in fact, that if we are thinking sloppily we often fail to distinguish between the domain and range of the obvious mapping.

Now let's back up and dot a few t's because while a monad is a fairly simple pattern in programming we aren't close to recognizing its mathematical definition (well, I'm not). To be a little formal we need categories, functors and natural transformations. Remember that a category is supposed to be a space together with some morphisms obeying laws? It was a long time ago for me too, but the morphisms in two of the examples I mentioned are merely the set of all functions with the signature Double => Double. Again, a morphism is a more general concept than a function, basically obeying associativity, and for that reason is often called an arrow (in mathematics). But let's not worry about the possible deficiency in our morphism collection right now because the set of all Double => Double functions is interesting enough.

Two quick checks. First, it is obvious that composition of functions in the mathematical sense - no side effects - is associative. Second, we'll need not concern ourselves overly with the existence of an identity morphism - the other category requirement - because constructing the function that takes x and returns x presents little challenge to most programmers! Safe to assume the identify exists.

So jolly good, DOUBLE is a category and, returning to our first example, Some( ) sure looks like it might help make a functor from DOUBLE into something we imaginatively call SOME-DOUBLE. (In general we could call the first category THING and the new category SOME-THING, I suppose). Likewise Symbolic( ), which I just made up, looks like it might help create a functor from DOUBLE into SYMBOLIC-DOUBLE, as we might name it, a category representing a computation but not its answer. Are the images of these maps themselves categories? Its seems pretty obvious that they are, because of the triviality of function application associativity and, to be pedantic, the obvious fact that the identity morphism maps to the identity morphisms in both SOME-DOUBLE and SYMBOLIC-DOUBLE.

We focus on the first functor momentarily, whose implementation in Scala comprises the constructor x=>Some(x) and the lifting f => (x => x map f) which uses the 'built-in' map method. I personally think of this built-in functor as a 2-tuple where MAP._1 acts on objects and MAP._2 acts on functions. I'll just use MAP going forward for either though, slightly abusing notation.

MAP = ( x=>Some(x), f => (x => x map f) )

As an aside, if I had my way Function1 would come with a map method so the morphism mapper f => (x => x map f) would look simpler, but it doesn't matter so long as we remember that morally speaking, the morphism-mapping part of the functor is just 'map'. When I say that map is 'built in' I really mean built in to the collections libraries, incidentally, because map is provided in the scala.collection.Traversible and scala.collection.Iterable traits and all collections are of one of these two types. Thus for all intents and purposes the combination (constructor, map) is an "off the shelf" functor although yes, for the third time, there are indeed plenty of morphisms other than functions - partial orders, paths on graphs et cetera - I'll let it go if you will.

What is the domain of MAP? My guess is that we should extend it to include not just DOUBLE but SOME-DOUBLE. Also SOME-SOME-DOUBLE, SOME-SOME-SOME-DOUBLE and so on ad infinitum. Call that big union the category Omega so that MAP is, in math speak, an endofunctor because it maps the category Omega into itself. We are getting a little closer to the mathematical definition now, I do hope, and here comes the last little jog.

A natural (and obvious) transformation between the built-in functor MAP applied once and the built-in functor MAP applied twice

To humor me apply MAP twice to get MAPMAP = MAP compose MAP, a functor that maps points x => Some(Some(x)) and of course messes with morphisms as well. For example, consider the function f(x) = x*x+3. MAP takes this morphism into another morphism which in turn happens to map Some(x) into Some(Some(x*x+3)). On the other hand MAPMAP would take f into a different morphism taking Some(x) into Some(Some(Some(x*x+3))). This just emphasizes the fact that MAP and MAPMAP are not the same thing, but they are dreadfully close.

The built-in functor MAP takes points to points and functions to functions

In fact with apologies for my handwriting and the mixing of mathematical and Scala notation on this diagram, MAP and MAPMAP are so similar that when applied to a particular morphism f(x)=x*x+2 they result in very similar looking functions g(x) and h(x). In fact g and h "are" the same Scala function - the very same pattern matching snippet of code I have scrawled in the picture! Needless to say the images of points a and b look quite similar as well. It is awfully tempting to relate MAP and MAPMAP by squishing Some(Some(x)) back to Some(x). The nice thing is, it costs us little to be precise.

In category theory there is a notion of a natural transformation between two functors that respects composition of morphisms. If we were category theorists we'd probably say "oh I suppose there must be a natural transformation from MAPMAP to MAP". I did not, come to think of it, have that exact experience myself, admittedly, because natural transformations are natural in some respects but also an abstract head f@#k the first time you come across them (they relate two functors, each of which in turn map functions to functions). The moral of the story: I think we should talk more about natural transformations to make us better people.

Actually we'll only talk about the one natural transformation mentioned: MAPMAP to MAP. As the definition on wikipedia explains (with F=MAPMAP and G=MAP, and C=D=Omega), this amounts to finding what is known as a component function (math speak) that lifts h=MAP(MAP(f)) onto g=MAP(f). That is straighforward because, translating back to Scala speak,

h andThen flatten = flatten andThen g

You couldn't hope for a more "obvious" commutative diagram that this - which is a good sign because in category theory many things are supposed look obvious after untangling of definitions or chasing of diagrams. We only need the top half of the diagram here, incidentally, and I don't think I need to define flatten. Even Scala newbies could implement the bit of it we need via pattern matching: x => x match {Some(y) => y}.

Flatten is the component realizing the natural transformation from MAP MAP to MAP (top half)
whereas Some is the component realizing the natural transformation from Id to MAP (bottom half)

Thus natural transformations are sometimes quite friendly, it would seem, and there is an even more trivial one lurking here in the bottom half of the diagram, namely the natural transformation that takes the identity functor into MAP. That is realized by a component that happens to translate into a constructor (programming speak, of course) x => Some(x). Evidently:

f andThen Some = Some andThen g

so the bottom half commutes.

Finally, the connection to mathematical monads

We're done. If you skip down the wikipedia monad page to the formal definition, ignoring the stuff at the start about adjoint functors which is just confusing, you'll see that a monad comprises a functor F and two natural transformations, one taking the identity functor Id to F and the other taking F compose F to F. The natural transformations are defined by their components which are, respectively, the constructor Some and the squishing operation flatten. To summarize, here is how we would relate the definition of a mathematical monad on Wikipedia to the day-to-day notion of a monad in programming Scala

Mathematical monad ingredients	Example of a common monad
Two categories C,D	One category comprising a an infinite union of similar types C=D=Omega=Union(DOUBLE,SOME-DOUBLE,SOME-SOME-DOUBLE,...)
A functor between the categories F: C->D	The "built-in" functor MAP taking points x=>Some(x) and functions f => (x => x map f)
A natural transformation taking the identify functor Id to F	A natural transformation taking the identify functor Id to MAP
A "component" realizing said natural transformation	A constructor x => Some(x) that "wraps" up x
A second natural transformation taking F F -> F	A natural transformation from MAP MAP to MAP
A second "component" realizing the second natural transformation	A flatten operation x => x match {Some(y)=>y} that unwraps y

As the mathematicians remind us there are some things to check, known as coherence conditions, but I leave that to the reader.

                                                   What's the point?

You may ask, why are we transforming functors when what we are looking for is as elementary as unwrapping a burrito? One might argue that understanding the abstraction is no harder than noticing the essential similarity between any two cases where this patterns arises (I'm not entirely sure I agree) and any good programmer should be on the lookout for that sort of thing. They should also be looking to communicate it to those maintaining code long after they have written it, and make that maintenance as simple as possible. Abstraction is next to cleanliness, as they say.

If one is going to talk loosely about "conserving structure" one might as well talk rigorously about it if it isn't all that much harder. And when you do, there are certainly some nontrivial things that fall out of category theory (for another time). No wonder the seemingly arcane area, once considered so obscure and abstract as to be useless, has made a charge up the Google n-gram leaderboard in recent times.

Popularity of "category theory" and "functional programming"

Still, I think most people would prefer to keep it a little wooly and merely recognize that we have an obvious identification between a doubly constructed type like List(List(3,4)) and one level down List(3,4), and that this identification needs to be carried over to morphisms - sorry functions. That is why you would naturally write flatmap yourself as the need arose, but might not necessarily appreciate the similarity to ComputeThis(ComputeThis(blah )) or ParseMe(ParseMe( blah )) or AbstractMe(AbstractMe( blah )) and so on. Actually Option is kind of like a partial computation too. You can think of Some(x) and Some(Some(x)) as calculations in stasis, if you will, although the only thing that "remains" to be calculated is whether x exists or not. My goodness, it always does!

One might wax mathematical in the hope of achieving a higher state of monadicness. Perhaps, for example, we "understand'' what MAP is doing because we root our intuition to pattern matching, kind of. If I tell you that g = (Some(x)=>Some(x*x+3)) you say "oh I get it, that is like x=>x*x+3 lifted up into Option[Double]. And if I tell you that h = (Some(Some(x))=>Some(Some(x*x+3)) you say "oh I get it, that is like x=>x*x+3 lifted all the way to Option[Option[Double]]". And you won't complain if I point out that g and h are the same, at least on some domain, because they are rooted in the same thing, the naked function x=>x*x+3. Perhaps the obviousness prevents us from seeing the top half of the second diagram independently of the bottom, however.

Next time: what's the built-in flatMap got to do with it?

It may seem odd that we have discussed monads in both programming and mathematics and I made only passing reference to flatMap, the built-in "bind" for Scala collections (defined here and here for Traversable and Iterable traits respectively). I leave that connection for next time.

Tuesday, November 15, 2011

Its a Row, Its a Plane, Its a SuperColumn! The excellent Cassandra data model and its not so excellent terminology

Someone once said that there are no difficult concepts, just bad explanations. This appears to apply to the Cassandra data model and for this reason there are plenty of explanations out there including "WTF is a supercolumn" by Arin Sarkissian. They have all added to my confusion. Some of them even give the impression that a supercolumn is some kind of super-concept and that doesn't appear to be the case. Indeed we should probably modify the adage. There are no difficult concepts, just really horrible naming conventions.

Allegedly Conceptual Prerequisites

1. A cell: (name, value, timestamp)

In my new naming convention a cell is a three tuple where the value is something we care about, the name is thought of as an index (row or column if you like) and the timestamp is something Cassandra cares about more than you typically would (synchonicity issues, et cetera). Now a three-tuple comprising name, value and timestamp is a straightforward thing unless of course you call it a column, which is the official terminology. Why would you call a single instance of a three tuple a column instead of a cell? I don't really know. Please forget I said "column", and try not to think about a column of soldiers

2. A list of cells {(name, value, timestamp), (name, value, timestamp)}

Self explanatory, unless you call it a column family. Since timestamps fade into the background you can think of it as a map from name to value.

3. A named list of cells (name, <list>)

Self explanatory, unless you call it a super-column.

Official Cassandra Terminology (Warning: Very Dumb)

Well that was easy. Unfortunately you might need to communicate with Cassandra, Cassandra documentation, or Cassandra fans so we are not even half way there. So I'll do my best to describe the essential terminology even though, if the glossary is anything to go by, it isn't entirely clear this has been agreed upon.

Definition: Column

Naively reading the documentation you would be forgiven for thinking that a column is a lone, solitary three tuple (name, value, timestamp) because that is what the documentation says and even more remarkably, that is what a column is. A column is a cell. This terminology might be logical, I suppose, in the sense that it specifies a column but actually that is not the intent or nor does a single cell comprise a column in the usual sense. Specification versus content might be my confusion and the documentation flicks effortlessly between data representation and data content as we continue to column families, and so forth. But I think we end up getting the joke. A column is a solitary three tuple just like a solitary soldier comprising a column.

Definition: Column family {(name, value, timestamp), (name, value, timestamp), ... }

A column family is what I referred to above as a list of cells, thus defining a list of what might be interpreted as column names (following the previous lack-of-definition to the word). But NO, that isn't the intent ... read on.

Definition: SuperColumn {(name, <list (i.e. map)>)}

A super column is like a column (i.e., a lone tuple, let's get that straight) but its value is actually a list of cells. Oh and it doens't have a time stamp so describing it as "like" a column seems a bit daft. There is no recursion to speak of as it turns out, despite this taunt.

It seems that a SuperColumn could be used to represent something like a column in a relational database table (a named vector of values), especially if you don't care for the name's in the map (they might be left blank). On the other hand, it is more like a row in a database in most uses of maps of maps (see below) making the terminology even dumber. Confused? Read on.

Definition: Column family

Next we learn that columns are "organized" into a column family comprising an ordered list of columns. If you are thinking that they might therefore represent how a table might be defined, don't. Perhaps though, if we were to represent something analogous to the contents of a table in a database we would need a super column family, right? Hmm, not really. But let's proceed.

Definition: SuperColumn family

Now column families, we are assured, can be either standard or "super" - just like petrol in Australia. So what is a super column family and is it analogous to the contents of a database table (in some special cases)? We may never know because a super column family is conspicuously absent from the official glossary and rumor is it is deprecated. Turning to wikipedia we confirm our suspicion that a super column family is a ... wait for it ... a tuple that consists of a key-value pair where the key is mapped to a value that are column families. That seems to be struggling towards grammatical and logical sense, but I still think that Cassandra is all about inferring the design, not just reading about it in the documentation - which would be boring.

De-mystifying Cassandra Terminology?

Now maybe that is a bit unkind and this description of column families (collections of cells) is a little more illuminating. It is clear what the design is, though not whether anyone can describe it carefully using official terminology to the letter (because that confuses the data structure with the data contents, columns with rows, et cetera). No wonder this is considered an advanced topic even for veteran database administrators. Or perhaps it is an advanced topic only for veteran database administrators and only if you use terminology attempting sloppy adherence to database terminology. Or maybe it is a really advanced topic if you deliberately use database terminology but confuse columns with cells?

Example: A static column family

Remember what a column family is? I've forgotten due to the apparent ambiguity between singular and plural infecting everyone who has written about Cassandra. But the use case intent is seemingly self-evident. Here for example the first "row" (whatever that is) has the same names as the second, although some are missing in the third. The layout is clear, unless you try to map it back to the definitions of column family and super column family that make no f@#ing sense whatsoever. And where exactly did the term "row" get defined, you ask? Well a row is a sorted map that matches column names to column names, according to the ever helpful glossary. It has a name (a row key) and a list of columns (i.e. cells) so it would seem to be ... yes ... a SuperColumn!).

Now hang on a second. Let's read that more carefully.

In a Column Family, a Row is a sorted map that matches column names to column values.

I'm inferring now that a single column family comprises all the green stuff in this picture (and not just one column, or row, say - using the usual meaning of "row" and "column") so that a row can be "in" a Column Family. Is this the correct usage of the terminology? I really don't know, and I'm pretty sure the following won't help us:

In a Super Column, a Row is a sorted map that matches Super Column names to maps matching Column names to Column values. Just to refresh your memory, a super column is a named list of cells whereas a column family is an unnamed list of cells. Does that help?

Example: A dynamic column family

Terminology aside we have the trivial generalization of the data structure to the case where the names need not coincide at all. Again, the intent is simple but what am I looking at? The adjective "dynamic" is applied to the collection, presumably, which implies that there is only one column family in the picture, but the "collection" we see is definitely not a single column family because a column family comprises a single list.

It is a plane after all

You'll forgive me for concluding that a lot of careless circular crap has been written about the Cassandra data model and if the authors had been responsible for any part of pure mathematics we'd all be well and truly screwed. Still, reading this documentation (which may or may not conflict with the "definitions" provided elsewhere) bolsters the belief (and I have to call it that) that a column family comprises all the green cells in the picture. It is simply a collection of cells containing whose repetition of names implies a layout (in either direction). Thus:

A column family resembles a table in an RDBMS. Column families contain rows and columns. Each row is uniquely identified by a row key. Each row has multiple columns, each of which has a name, value, and a timestamp.

Now even if we pander to the dumb-ass terminology "column" one has to read this very consciously and recognize that "column" and "row" are very, very different. One is a single cell and one is a map. Repeat after me. A column is a single cell. A row is a map which makes explicit the implied structure in the picture where the columns (cells) have been laid out in a plane and the names used to achieve allignment. See how I avoided the common use of "row" there to describe the rows - doh - I mean horizontal sections of green things.

You see it all makes complete sense:

A column is the most basic unit of representation in the Cassandra data model. A column is a triplet of a name (sometimes referred to as a "key"), a value, and a timestamp.

AARRRGGGHHHHH! For the love of god could somebody out there please stop pretending we are all "wrestling" with the data model concept (and not use phrases like "think of it as ...") and just define the fucking thing with remotely sensibly terminology?

Tuesday, November 8, 2011

Reading scala's non-alphanumeric syntax

Welcome to Scala for Quants, the sporadic rants of a newbie. As for the language, Scala By Example is Odersky's nice introduction. The Scala Tour is a good primer on traits amongst other things which will get OO people off and running if you don't care for the functional aspects. (Traits are the major, or minor, departure from Java and C++ depending on how you look at it - justified here in the context of the long running inheritance debate).

Scala is pretty easy to read for the most part but what tripped me up at first was the non-alphanumeric syntax. My expanding cheat sheet:

1. On some uses of =>

()=>B, =>B define functions with no arguments equivalent to Function0[B]
x: =>A denotes a call-by-name parameter when in a function (see pass-by-name), or simply a declaration of a variable x returning A that must be called without parentheses. It is the same thing.
case 0=>, case i:Int =>, case Foo(a,b)=>, case Foo(a,b) if a==b => appear in match statements including those pattern matching case classes.
import java.util{Date=>UDate} imports and renames java.util.Date.

2. On some uses of _

"I don't really have a good conception of what wildcards are" - Martin Odersky (interview)

import.java_util{Date=>_} will exclude a symbol from import
case _ is the default in a match statement (c.f. "otherwise")
Characters +,-,! and ~ can be used as prefix operators in any class by defining unary_+ etc
var y:Int = _ is a default initial value. Without the wildcard you will be declaring an abstract variable (see abstract types in the Scala Tour)
case <body>{b @ _*}</body> => is an example of matching multiple elements in an case statement embedding XML (see Jim McBeath's primer)
_3 can be used to access the 3'rd element of a tuple, for example x._1
Existential types. A wildcard in a type binds to the closest enclosing type application (see scala-lang) so that List[_] is equivalent, for example, to List[T] forSome {type T }.

3. On some uses of :, ::, :::

In declarations. var a: Int = 7, val n = 5, val n:Int define mutable, immutable and abstract variables respectively.
These also may be used in match statements, as in case: i:Int, and function declarations, as in def(i:Int)
x: =>A denotes a call-by-name parameter when in a function (see pass-by-name), or simply a declaration of a variable x returning A that must be called without parentheses. It is the same thing.
1::List(2,3) is the cons operator that prepends and element to a list, as explained in scala list examples
oneList:::twoList is list concatenation

4. On some uses of <-, <:, <%, :>

for (n <- 0 to 6; e = n%2; if e==0 ) yield n*n
for ((a,b) <- list) yield a+b
for (x <- 0 to 4; y <- 0 until 3) yield (x,y)
class Set[A <: Ordered[A]] is an upper bound
trait Set[ A <% Ordered[A]] is a (weaker) view bound indicating that A must be convertible by an implicit conversion

5. On some uses of %

T<%U means type U is a view bound for T, as above.

6. On use of /:

(0/:list)(_+_) is foldLeft

7. On use of *

def printf(format:String, args:Any*): String is an example of the use of Vararg parameters

8. On some uses of +

class Stack[+A] { indicates that subtyping is covariant in that parameter

9. On some uses of #::, #:::

fibs: Stream[BigInt] = BigInt(1) #:: BigInt(2) is an alternate way to build streams (see Stream docs) consWrapper is an implicit def adding #:: for cons and #::: for concat

References & further reading

In this reminder of the syntax I'm borrowing heavily from this already terse syntax primer by Jim McBeath. As Jim notes, symbol names can use :, \, *, +, ~ and pretty much whatever you want, but the precedence is defined by the first character. See also the Scala cheat sheet and Scala API.