пятница, 9 мая 2014 г.

How to Make an SQL from Clojure or My Macros Kata

Despite the fact that I started learning Clojure about 2 years ago, until recently I didn’t explore, maybe, the most fascinating part of the language – macros. I did try to compose simple macros once or twice, but generally they failed to grasp my attention and I never understood how they work and, more importantly, when should I use them. To fix this, last month I decided to implement something interesting with the help of macros and here I will show what I did and sum up what I learnt.

The example that I’ve taken might be not the most suitable for the goal, but it worked out quite well. The idea was to allow for writing SQL-like queries against the lists of Clojure dictionaries or records – basically anything that allows key-based access to elements. The value of such a thing is doubtful, but SQL is a well-known Domain Specific Language and the fact that macros are considered a great tool for creating new languages in Clojure made me chose it as an exercise. Besides, I remember that there was an exercise like this – similarly with SQL-like queries – in the Joy of Clojure book, which made me think that this task suits my learning goals well. However, I intentionally avoided peeking into the book to not spoil my exercise and, kind of, do everything myself without borrowing others’ ideas and solutions.

The results at which I arrived are quite fascinating: I am now able to write almost real SQL against lists of records. The key difference between my queries and SQL is the use of keywords as the names of fields to select and filter on. In addition to that, well, my selects come in parenthesis, because I am still writing LISP. Otherwise, everything looks very similar to the queries one would execute against a database and that feels pretty awesome. Now, let me share my excitement and show you the code.

In the first article we will begin with simply selecting some fields from the records and add where clause for filtering. In the second one I will show how we can enable joining tables together and ordering the results. So, what we want to achieve at the first step is to be able to write a query like this:

(def table [
  {:a 1 :b 100 :c "100" :d 4}
  {:a 2 :b 200 :c "200" :d 3}
  {:a 3 :b 300 :c "300" :d 2}
  {:a 4 :b 400 :c "400" :d 1}])

(select :a :b from table)

This is not a very difficult task, because all that we have to do is limit the set of key-value pairs in each dictionary in the resulting list to only the keys, which are specified in the query. On the other hand, we have this from symbol to treat, but that’s quite easy as well.

(defmacro select [& what]
  (let [fields (set 
          (take-while #(not= 'from %) what))
      source (fnext 
          (drop-while #(not= 'from %) what))]
      `(map (fn [record#]
            (into {} (filter #(~fields (first %)) record#)))
          ~source)))

It turns out that similarly to the functions I can’t easily write a macro, which solves precisely one problem, so several things are going on in select. First, we have to get hold of a set of keys denoting fields to select. All of them come before the from word and for now we can assume that everything that goes before it is a field name. This said, we just take elements of the argument list until we meet from, push them into a set and bind it to the fields symbol. In a very similar fashion we extract the table, from which we want to select records – I expect that it is found right next to from, so I skip everything until from as well as from itself and grab the first element of the remaining sequence. Having fields and source, we can actually define what our macro will return to the caller. We simply map a function that filters each record according to the presence of the key in the fields set and composes a hashmap from the resulting key-value pairs  onto the “table”. Although this doesn’t feel like the most efficient and nice solution, I’m okay with it for now.

Because we are writing a macro we want to return something that will be evaluated upon execution of the program – that’s why we syntax-quote our map call with the `. Syntax-quoting prevents the form from being evaluated, but unlike simple quote (‘) it allows us to avoid unquoting map, fn and all the other stuff that is used inside it. One of the problems with quoting surfaces when we want to use a binding inside the body of a macro – in our case with fn. The idea is that binding requires a symbol and we need to introduce it somehow. There are several alternatives, but the simplest one is to use the gensym thing through suffixing a symbol with # character. When encountered with it inside a syntax-quoted form, Clojure will generate a non-qualified symbol, which we can then use. Finally, in select we need to unquote fields and source so that proper stuff will be inserted in the respective places. If you forget to unquote these symbols, syntax reader will try to resolve them to fully-qualified ones. This may result either in a failure in case you don’t have such a symbol outside the macro, or in a subtle failure – if you have one. The latter situation might lead to great confusion: the code that used to work will at some moment get broken because its surroundings have changed. Fortunately, in most cases such an error will surface quite soon if not instantly. Having done all these things right, we can check whether they work and, fortunately, they do:

(select :a :b from table)
;=> ({:a 1, :b 100} {:a 2, :b 200} {:a 3, :b 300} {:a 4, :b 400})

Let us try to understand what did macro allow us to do here that a function wouldn’t permit. Most notable thing is our use of the from symbol. Despite the fact that it is not bound to anything we can still utilize it in the macro – as a separator. A function wouldn’t allow that because it would try to evaluate all the arguments passed into it and fail with from. Macros, on the other side, receive their arguments in unevaluated form, which allows us to play with them in different and strange ways. Note that we could achieve the same behavior with a function if we used, for example, a keyword :from instead of the pure symbol, so there is actually little value in using macros like this. However, knowing that Clojure allows us to write down some utterance and can live with it is pleasant and inspiring.

Now, when we can select stuff, let us introduce filtering. This task is less trivial not only due to more complex constructs needed for conditions, but also because they will come in infix notation, which is natural for SQL, but hardly a common thing in LISPy Clojure. Even though this may sound difficult, it boils down mainly to tossing the list of arguments and just requires covering a couple of cases and careful use of recursion.

To make filtering possible we will define a macro that creates a Clojure-evaluatable condition from an SQL-like one. First, we will try to cover the most simple case – something like select :a :b from table where :a < :d. Our new macro – let’s call it condition – will get the :a < :d part of the above query from the select and return something like (< (:a table) (:d table)) – the piece, which Clojure will easily consume. The first version is going to be ve-e-ery simple:

(defmacro condition [t k1 op k2]
  `(~op (~k1 ~t) (~k2 ~t)))

The only thing that we do here is changing infix notation into prefix one and accessing the values from a record with the keys specified in the condition. We can use macroexpand to check that condition easily treats simple cases:

(macroexpand '(condition table :a < :d))
;=> (< (:a table) (:d table))

(macroexpand '(condition table :b = :c))
;=> (= (:b table) (:c table))

The fact that the macro fits easily into one line makes me ask myself whether there is any reason to writing it. However, instead of trying to find an answer I will simply fix the cause and… make it longer. One thing that condition misses is the ability to consume literals. For example, it is pretty natural to write a filter like this: where :a > 2 and we definitely should support this. We will do it the simple way – if we face a keyword in a condition, we will use it to get the corresponding value from the record and inject it in the resulting code. Otherwise, we won’t do any transformations to the value and will include it into the result as is:

(defmacro condition [t k1 op k2]
  (let [v1 (if (keyword? k1) `(~k1 ~t) k1)
      v2 (if (keyword? k2) `(~k2 ~t) k2)]
  `(~op ~v1 ~v2)))

Now condition treats both keys, which denote fields of the records, and mere numbers or strings equally well allowing us to filter both based on relationships between fields and on their comparison with some external values. What makes me feel particularly great about the new version of the macro is that there are no checks for keywords in the code that it outputs – it determines whether it looks at a keyword in the let form, which is not a part of the resulting code. This makes the output of the macro shorter and a bit more efficient. To see it we can do macroexpand as usual:

(macroexpand '(condition table :d not= :a))
;=> (not= (:d table) (:a table))
; no checks for keyword, table accessed in both cases

(macroexpand '(condition table :a < 3))
;=> (< (:a table) 3)
; no checks for keyword, and table accessed only once
; while for 3 we have just 3

Our condition macro can’t handle neither complex nor long conditions so far. We will make it almost universal a little bit later, but now let’s take a step back and assemble condition and select together. When faced with this task first I did introduce another level of indirection in the form of additional helper function, which would wrap the condition. However, it turns out that one can do this with much less clutter. After some experiments, I ended up using a lambda expression in the body of select:

(defmacro select [& what]
  (let [fields (set 
          (take-while #(not= 'from %) what))
      source (fnext 
          (drop-while #(not= 'from %) what))
      conditions (next (drop-while #(not= 'where %) what))]
      `(map (fn [record#]
            (into {} (filter #(~fields (first %)) record#)))
          (filter #(condition % ~@conditions) ~source)))) 

In the snippet above we have added another binding to take hold of the conditions passed into select. We assume that they go after the where word and thus extract them from the arguments list similarly to the fields keys and the table. Then, in the body of the macro we introduce filtering, which is performed with a lambda making use of our condition macro and the list of conditions in infix notation. The most surprising part is that this thing actually works:

(select :a :b from table where :a < :d)
;=> ({:a 1, :b 100} {:a 2, :b 200})
(select :d :c from table where :d >= 3)
;=> ({:c "100", :d 4} {:c "200", :d 3})
(select :a :b :c :d from table where :a = :b)
;=> ()

That’s a wonderful achievement, we can write SQL queries in Clojure! Well, almost. For now, we miss a couple things including joins, ordering and more complex conditions. The latter seem to be the most natural thing to address next, so let us enable queries like (select :a :b from table where :a <= 2 or :c = “400”). At the current stage trying to execute something like this will lead us to a “Wrong number of arguments passed to the condition macro” error, which is a not so often case when the error message is actually helpful. We do not in fact allow conditions composed of more than three elements and that’s what we will fix right now.

The most trivial way to allow for the longer conditions like :a <= 2 or :c = “400” is to introduce another version of condition macro, which would take 8 arguments: [t k1 op1 k2 link k3 op2 k4]. Here k3 and k4 are either field keys or values (:c and “400” in the above example) similar to k1 and k2, op2 is a comparison operator (=) like the op1 and link is a Boolean function (or), which glues two parts together. Now, 8 arguments is quite too much even if we try to bring SQL into Clojure. If you don’t think so, it is not difficult to come up with a three-part condition which would require 12 arguments with even more stupid names. Fortunately, we don’t have to write all these things out because the similarity between k3 op2 k4 and k1 op1 k2 suggests that we can generalize the implementation to allow arbitrary long conditions with quite a short and concise piece of code:

(defmacro condition 
  ([t k1 op k2]
    (let [v1 (if (keyword? k1) `(~k1 ~t) k1)
        v2 (if (keyword? k2) `(~k2 ~t) k2)]
    `(~op ~v1 ~v2)))
  ([t k1 op1 k2 link & other]
    `(~link
      (condition ~t ~k1 ~op1 ~k2)
      (condition ~t ~@other))))

First of all, the version with 4 arguments remains intact meaning that we don’t break the existing code, which is great. Another good thing to note, is that we reuse it inside the version with the longer argument list – there is no need to write the checks for keywords and move arguments once more, given that we have written this code before. Moreover, this new version of condition might reuse itself in case the other part of arguments list includes more than three elements. This basically means that in addition to the original goal we just allowed writing conditions like :a = 1 or :a = 2 or :a = 3 or :a = 5 or :a = 6 or :a = 7 at no cost. Recursion is a great thing by itself, but it is even greater when combined with argument lists of arbitrary length. Especially so when it works:

(select :a :b from table where :a <= 2 or :c = “400”)
;=> ({:a 1, :b 100} {:a 2, :b 200} {:a 4, :b 400})

(select :a :b from table where :a = 1 or :a = 2 or :a = 4 or :a = 5 or :a = 6 or :a = 7)
;=> ({:a 1, :b 100} {:a 2, :b 200} {:a 4, :b 400})

This all looks incredible, but there is another crucial thing that we miss – conditions can be complex, which means that we might want to include prioritization with parenthesis into them. This could be a difficult and messy thing to implement should we write in something that is not LISP. However, with Clojure macros everything wrapped in parenthesis is a list and can be treated as a list. The incredible power behind this idea becomes clear once we see how little code we have to add to the existing condition macro to  dramatically increase allowed complexity of conditions:

(defmacro condition
  ([t complex]
    `(condition ~t ~@complex))
  ([t k1 op k2]
    (let [v1 (if (keyword? k1) `(~k1 ~t) k1)
        v2 (if (keyword? k2) `(~k2 ~t) k2)]
    `(~op ~v1 ~v2)))
  ([t k1 op1 k2 link & other]
    `(~link
      (condition ~t ~k1 ~op1 ~k2)
      (condition ~t ~@other))))

As you can see, the only new thing in the macro is another override accepting two arguments: the usual record t and a complex condition. Having taken the complex part it basically unwraps it and calls itself with the extracted elements. As a positive side effect, this little addition drops the optional parenthesis in case someone decides to surround the entire restrictions list with them.

(select :a :b from table where :a = 2 or (:d < 3 and :b > 300))
;=> ({:a 2, :b 200} {:a 4, :b 400})

(select :a :b from table where (:a <= 2 or :c = “400”))
;=> ({:a 1, :b 100} {:a 2, :b 200} {:a 4, :b 400})

Once again, recursion and pattern matching do a lot of hard work for us. However, even though this feels like magic, it isn’t and there is an input that we should definitely handle, but don’t treat well so far. That’s the first of the above examples but with reversed order of restrictions: (select :a :b from table where (:d < 3 and :b > 300) or :a = 2) yields an error. The problem is that we allow parenthesis only in the tail position of the conditions list. On the other side, should we do TDD, tests would tell us that we have already broken the select macro – it doesn’t work without filtering anymore, because no matter if there is the where word and something after it in the arguments list, we always try to construct a filter. Since condition macro doesn’t work with a single argument, record, it rightfully fails. Here is the final version of the macro, which addresses both issues:

(defmacro condition
  ([t]
    `true)
  ([t complex]
    `(condition ~t ~@complex))
  ([t k1 op k2]
    (if (seq? k1)
      `(~op
        (condition ~t ~k1)
        (condition ~t ~k2))
      (let [v1 (if (keyword? k1) `(~k1 ~t) k1)
          v2 (if (keyword? k2) `(~k2 ~t) k2)]
      `(~op ~v1 ~v2))))
  ([t k1 op1 k2 link & other]
    (if (seq? k1)
      `(~op1 
        (condition ~t ~k1)
        (condition ~t ~k2 ~link ~@other))
      `(~link
        (condition ~t ~k1 ~op1 ~k2)
        (condition ~t ~@other)))))

To handle queries without filtering we simply introduce another overload of condition for empty restriction list that returns quoted truth (this even sounds cool!). When we invoke it from select, true ends up in the body of the filter function making it allow every record that it sees. As for making parenthesis work in any position, to allow this we have to do a little bit more work. The idea is that when introducing prioritization we made our older assumptions about what is a comparable value (k1, k2, etc.), what is a comparison operator (op, op1 and op2) and what is a Boolean link between several restrictions fallible. However, to fix this we can just check whether we look at something enclosed in parenthesis and handle the thing accordingly if we do. That’s what the new seq? parts are for. Even though this makes the macro grow in size significantly, I believe that’s a fair price for almost universally working solution:

(select :a :b from table where (:d < 3 and :b > 300) or :a = 2)
;=> ({:a 2, :b 200} {:a 4, :b 400})

(select :c from table)
;=> ({:c "100"} {:c "200"} {:c "300"} {:c "400"})

Now, that we achieved an important milestone of selecting stuff from lists of records or dictionaries and filtering it freely, let me sum up what I learnt from the exercise:
  1. Macros really make one think in a different way than when composing mere functions. Even though it is hard to communicate this feeling, writing code that produces code is whole another sensation than writing code that produces data.
  2. On the other side, Clojure does its best to make code as similar to data as possible. Storing programs in lists is cool, because here and there this saves you from extra difficulties.
  3. Arguments passed to a Clojure function are evaluated first; those passed to a macro are not. This makes a huge difference.  
  4. You can’t apply a macro, but you can splice-unquote arguments for it. 
  5. In a macro you can have code that it will return – usually quoted – and code that determines what will be returned and does not show up in the macro’s result. 
  6. Doing katas is worth the effort even if you don’t feel like it helps you understand anything. If something feels too much for you just force yourself into it and logic will gradually surface. To some extent learning is like violence. 
  7. If you don’t refactor, you might be missing a chance to understand both your code and your tools.
  8. Explaining what you have done in a sum-up blog entry might feel stupid, but if you are not doing this you are maybe missing another chance to understand both your code and your tools.

For now I didn’t cover a couple of things, which one expects from SQL. Most important are joins and ordering – these I will implement in the next article. Besides, there are a lot of minor features like, for example, selecting all the fields from a table with a star (*). This particular one is extremely easy to introduce, so I won’t show it here – you can take a look at the code on GitHub if you want.

Note that the above GitHub repository differs in some ways from the code shown in this post. If you want the latter, just grab this single file – zero dependencies is great.

I have also discussed my affair with macros in the recent episode of our Code Speak Loop podcast with my co-host Michael Pankov.

Комментариев нет:

Отправить комментарий