Wednesday, February 10, 2010

Continuation Monad

Last month I worked on a TopCoder problem that led me to continue my exploration of monads. The problem is to sort a set of elements using comparisons whose cost is a function of the comparands. Solutions comprise a pair of methods, one of which initializes the program with the costs, and the other of which is invoked repeatedly to effect the dialog between the program and the less-than oracle.

Now choose your own adventure. Do you:

Ask how a real computer scientist would solve it
Watch me muddle through comparison-based sorting with variable costs
Follow my continuation monad implementation in Java

Comparison-based sorting with variable costs

Sidebar: true story

I've seen a related problem in real life. Our app iDeal helped salespersons develop contract terms subject to Boolean constraints called guidelines, analogous to the Boolean trees examined in Charikar et al. We didn't actually have to buy our guideline results, but we did care about evaluation speed. There were basically two prices: low (computable on our local app server) and high (computable at a minicomputer reached by at least two SOAP hops and a batch job. Unlike the Charikar model, our cost function was not additive: it behaved like or, giving a result of "expensive" whenever at least one guideline couldn't be computed locally and "cheap" otherwise. Consequently, the real-life challenge was less about algorithm design and more about software engineering: given that guideline evaluation is demanded at many call sites, how can we avoid unnecessarily consulting the minicomputer?

I first thought of divide and conquer approaches to avoid redundant queries (don't ask a < c if it is already known that a < b and b < c). I thought about algorithms for the special case when C(a, b) = 1: mergesort, heapsort, and quicksort. All of these relate to binary trees, and I thought about properties a binary search tree constructed at optimal cost would have in the general case.

Shouldn't the root of the tree be the element for which the sum over pairs of itself with each of its descendants is minimal? Intuitively, we compare every element with the root to locate the element in the tree, and we want to minimize the cost of inserting each element by this recursive process. Not quite: this fails to account for tree shape (consider p < q < r with costs C(p, X) = 1 for all X in {p, q, r}; C(q, r) = 2). And thinking back to the special case of a constant cost function, it's clear that this root selection heuristic has no power. It's also often mentioned in discussions of e.g., quicksort, that it's important to have a balanced tree. But we're not looking just to balance and fill the tree; It's easy to ensure that every left child in an optimal tree is a leaf node by carefully choosing the cost function.

The shape of the tree is determined both by the sequence of queries the program makes and by the order of elements determined by the oracle. The best choice for the root is the median of the elements, so an exact solution is O(n^2) but we could think about approximate selection and average-case performance.

What about a bottom-up approach? Do our trees have optimal substructure? First: in an optimal tree, is every subtree optimal? Suppose not. Then some optimal tree has a subtree that can be rearranged for lower cost. But this doesn't affect the cost to construct the rest of the tree, so after the rearrangement the overall cost is lowered. This contradicts the supposition that our original tree was optimal with a suboptimal subtree. But second: can we efficiently combine optimal subsolutions to derive optimal solutions?

A straightforward mergesort would guarantee a minimal number of comparisons, but not necessarily a minimal-cost set of comparisons. Although a minimal tree implies minimal subtrees, not all combinations of minimal subtrees constitute a (rearrangement of a) minimal tree. To see this, imagine a cost function that does not have the triangle property. In this case, the cheapest way to combine two trees might involve using additional vertices not a part of either original tree.

Let g(t) = sum(C(n, n') for n in t for n' in ~t). Recursively merge the runs with the lowest g.

Pf. g-ordered mergesort yields optimal solutions.

Let A and B represent two C-optimal subsolutions whose g-costs are least among all subsolutions. Claim: the combination S of A and B is C-optimal. Suppose not. Then there is a (sub)solution S' containing all the elements of a and b such that C(S) > C(S'). S' is a proper superset of S, so we can obtain it by adding elements from the complement of S. These missing elements must exist in other optimal subsolutions having higher g-costs than that of S. Also, each of these elements must have edges to each of a and b in the optimal tree of S. Consider any subsolution G hosting one of these missing elements. It's g-cost is \sum_{g \in G}(\sum_{a \in A}(C(g, a)) + \sum_{b \in B}(C(g, b)) + \sum_{c' \in c}(C(g, c'))), where a, b, and c' are drawn from our input sets and the complement of their union respectively. To simplify the notation, I'm going to switch now to using implicit summations: the g-cost of subsolution G is C(g, a) + C(g, b) + C(g, c). The g-cost of S is c(a, c) + c(b, c) + c(s, c). The g-cost of a is c(a, b) + c(a, c). The g-cost of b is c(a, b) + c(b, c). A little algebra and the observation that the complement of the union of a and b includes all the g \in G, we can derive a contradiction from the assumption that there exist elements outside the min-g-cost subsolutions a and b that would improve the C-cost of the combination of a and b:


c(s a) + c(s b) + c(s c) > max(c(a c) + c(a b); c(b c) + c(a b))
> max(c(S) - c(s, c) - c(b, c) + c(a, b);
c(S) - c(s, c) - c(a, c) + c(a, b))
c(s a) + c(s b) + 2c(s c) - c(a b) - c(S) > max(- c(b, c); - c(a c))
- c(s c) - c(a c) - c(b c)
c(s a) + c(s b) + c(s c) - c(a b) - c(a c) - c(b c) > -max(c(b c); c(a c))
c(s a) + c(s b) + c(s c) - c(a b) - c(a c) > 0
c(s a) + c(s b) + c(s c) - c(a b) - c(b c) > 0
c(s a) + c(s b) + c(s c) > c(a b) + c(a c)
c(s a) + c(s b) + c(s c) > c(a b) + c(b c)
c(a c) includes c(a s) so c(a c) > c(a s)
c(s b) + c(s c) > c(a b)
c(s a) + c(s c) > c(a b)
c(s b) - c(s a) > 0
c(s a) - c(s b) > 0
⟶⟵

Sadly, although this heuristic won't mislead us, it may often fail to answer by telling us that more than two trees have the least g-cost.

A more dynamic heuristic could rely on probabilistic estimates of the information gain at each query. Assuming uniform random permutation, initially we should just do the cheapest comparison. As we progress, we'll develop equivalence classes: these elements could be anywhere; those elements could only be in the top 6. Comparisons involving elements in large classes or classes near the min or max yield relatively little information. We could combine estimates of value with the given cost information to get reasonable average-case performance.

I decided not to work harder to prove a tight lower bound on the solution to the problem or to find a better algorithm or heuristic. It's TopCoder after all, and speed is important. I decided instead to move on to implementation and Google for a better answer later.

So in a nutshell my solution is: initially regard every element as a subsolution. Order them by g-cost and merge the two lowest-ranked solutions. Recurse.

Continuation monad

The online interactions between the TopCoder oracle and my solution raises an interesting question. The first thing that occurred to me of course was to write g and merge, arrange to store state in instance variables, and update my instance state incrementally in the two callback functions whose existence is mandated by the oracle. On second thought, I wondered if my code could more closely resemble the nutshell description I gave above, with the state maintenance handled separately. It seemed like a good excuse to get some first-hand experience with continuation-passing style, a canonical form for programs that can be paused and restarted. The continuation monad brings cps into a more general framework, improving modularity.

So I broke out my old favorites Monads for Functional Programming and Comprehending Monads. I also found a couple of new sources: Lecture notes: Continuations and call/cc and The Continuation Monad in Clojure. Continuations are also discussed at some length in the lambda papers, although the idea originated a decade earlier.

First, the monad preliminaries. When I implemented the maybe monad in C#, I had the ambition to support composition of different monads and that led to some difficulty with the type system. Also, I noticed how cumbersome the syntax was and what effect that had on my enthusiasm for explicit and separate expression of an abstraction for monads. And that was with C# 3, which has overtaken Java in its evolution towards greater expressive power. So this time I decided to focus strictly on the continuation monad and the question of whether it could enable me to separate the oracle i/o from the specification of my sort algorithm.

Recall the monad laws:


@Test
// unit a ⋆ λb. n = n[a/b]
public void leftUnit() {
    Func1<String, CMonad<Integer>> f =
        new Func1<String, CMonad<Integer>>() {
            public CMonad<Integer> f(String a) {
                return CMonad.unit(a.length());
            }
        };

    String a = "a";
    CMonad<Integer> expected =
        f.f(a);
    CMonad<Integer> actual =
        CMonad.map(
            f,
            CMonad.unit(a));

    assertMEquals(expected, actual);
}

@Test
// m ⋆ λa. unit a = m
public void rightUnit() {
    CMonad<Integer> m =
        CMonad.unit(42);
    Func1<Integer, CMonad<Integer>> id =
        new Func1<Integer, CMonad<Integer>>() {
            public CMonad<Integer> f(Integer x) {
                return CMonad.unit(x);
            }
        };

    assertMEquals(
        m,
        CMonad.map(id, m));
}

@Test
// m ⋆ (λa. n ⋆ λb. o) = (m ⋆ λa. n) ⋆ λb. o
public void associative() {
    Func1<String, CMonad<Integer>> fa =
        new Func1<String, CMonad<Integer>>() {
            public CMonad<Integer> f(String x) {

                return CMonad.unit(x.length());
            }
        };

    Func1<Integer, CMonad<Integer>> fb =
        new Func1<Integer, CMonad<Integer>>() {
            public CMonad<Integer> f(Integer x) {
                return CMonad.unit(x * 2);
            }
        };

    final CMonad<String> m =
        CMonad.unit("m");

    final CMonad<Integer> n =
        CMonad.unit(5);

    final Func1<Integer, CMonad<Integer>> o =
        new Func1<Integer, CMonad<Integer>>() {
            public CMonad<Integer> f(Integer x) {
                return CMonad.unit(x - 1);
            }
        };

    CMonad<Integer> expected =
        CMonad.map(
            new Func1<String, CMonad<Integer>>() {
                public CMonad<Integer> f(String x) {
                    return
                        CMonad.map(
                            o,
                            n);
                }
            },
            m);

    CMonad<Integer> actual =
        CMonad.map(
            o,
            CMonad.map(
                new Func1<String, CMonad<Integer>>() {
                    public CMonad<Integer> f(String x) {
                        return n;
                    }
                },
                m));

    assertMEquals(expected, actual);
}

(assertMEquals is more or less what you'd expect):


private  void assertMEquals(CMonad expected, CMonad actual) {
    Func1 id =
        new CostlySorting.Func1() {
            public T f(T x) {
                return x;
            }
        };
    T exp = expected.f(id);
    T act = actual.f(id);
    assertEquals(exp, act);
}

My continuation monad satisfies those:


public abstract static class CMonad<X> {
    private CMonad() {}
    public abstract <R> R f(Func1<X, R> k);
    public static <X> CMonad<X> unit(final X x) {
        return
            new CMonad<X>() {
                public <R> R f(Func1<X, R> k) {
                    return k.f(x);
                }
            };
    }

    // aka bind aka ★
    public static <X, R> CMonad<R> map(final Func1<X, CMonad<R>> f, final CMonad<X> mx) {
        return new CMonad<R>() {
            public <S> S f(Func1<R, S> k) {
                return mx.f(f).f(k);
            }
        };
    }
}

call/cc seemed like a useful abstraction, so I implemented that in terms of my monad:


public Integer divideByZeroEg(final int x, final int y) {
    CMonad<Integer> eg =
        CMonad.callcc(
            new Func1<Func1<Integer, CMonad<Integer>>, CMonad<Integer>>() {
                public CMonad<Integer> f(Func1<Integer, CMonad<Integer>> esc) {
                    return
                        CMonad.map(
                            new Func1<Integer, CMonad<Integer>>() {
                                public CMonad<Integer> f(Integer z) {
                                    return CMonad.unit(x / z);
                                }
                            },
                            (y == 0 ? esc.f(42) : CMonad.unit(y)));
                }
            });
    return eg.f(
        new Func1<Integer, Integer>() {
            public Integer f(Integer x) {
                return x;
            }
        });
}

@Test
public void divideByZeroEg() {
    assertEquals((Object)5, divideByZeroEg(20, 4));
    assertEquals((Object)42, divideByZeroEg(20, 0));
}

public static <X, Y> CMonad<X> callcc(final Func1<Func1<X, CMonad<Y>>, CMonad<X>> g) {
    // .\ k -> g(.\x -> .\k' -> kx)k
    // g :: ((X -> MY) -> MX)
    // g :: ((X -> (Y -> R)) -> (X -> R'))
    // k :: X -> R''
    // R'' = R
    // R = X -> R'
    return
        map(
            new Func1<X, CMonad<X>>() {
                public CMonad<X> f(X x) {
                    return unit(x);
                }
            },
            new CMonad<X>() {
                public <R> R f(final Func1<X, R> k) {
                    CMonad<X> gresult =
                        g.f(
                            new Func1<X, CMonad<Y>>() {
                                public CMonad<Y> f(final X x) {
                                    return new CMonad<Y>() {
                                        public <S> S f(Func1<Y, S> kprime) {
                                            return (S)k.f(x);
                                        }
                                    };
                                }
                            });
                    return gresult.f(k);
                }
            });
}

and then I implemented a while loop atop continuation monad and call/cc:


@Test
public void finiteWhile() {
    final int [] countdown = new int [] {3};
    CMonad.cpswhile(
        new Func0<Boolean>() {
            public Boolean f() {
                return countdown[0] > 0;
            }
        },
        new SideEffectful() {
            public void f() {
                countdown[0] -= 1;
            }
        });
    assertEquals(0, countdown[0]);
}

private static <T> CMonad<T> recurse(CMonad<T> m) {
    final WhateverClosure<CMonad<T>> loop =
        new WhateverClosure<CMonad<T>>();
    loop.whatever =
        CMonad.map(
            new Func1<T, CMonad<T>>() {
                public CMonad<T> f(T x) {
                    System.out.println("recurse");
                    return loop.whatever;
                }
            },
            m);
    return loop.whatever;
}

public static void cpswhile(final Func0<Boolean> loopInvariant, final SideEffectful body) {
    callcc(
        new Func1<Func1<Object, CMonad<Object>>, CMonad<Object>>() {
            public CMonad<Object> f(final Func1<Object, CMonad<Object>> esc) {
                return recurse(
                    new CMonad<Object>() {
                        public <R> R f(Func1<Object, R> f) {
                            System.out.println("start.f");
                            if (!loopInvariant.f()) {
                                System.out.println("done");
                                return
                                    esc.f(null).f(f);
                            }
                            System.out.println("body.f");
                            body.f();
                            return f.f(null);
                        }
                    });
            }
        }).f(new Func1<Object, Object>() {
                 public Object f(Object x) {
                     return x;
                 }
             });
}

public interface SideEffectful {
    public void f();
}

Why reimplement the while? Java's while keyword only composes in the predefined ways that Java's language designers cared about. I wouldn't be able to effect oracle I/O between loop iterations, except by resorting to contrived method invocations in the loop body or test. Also, because of the communications mechanism is method invocation, I couldn't actually use while without also using threads. By using composable continuation monads for control flow, I hoped to express my sort algorithm separately from managing the data flow between the program and the TC oracle.

Of course, Lisp and Haskell don't always translate easily to Java:


@Test
// (recipe for stack overflow)
public void infiniteWhile() {
    // customize this value for your stack size/patience
    int approximatelyInfinity = 1;

    final Calendar finish = Calendar.getInstance();
    finish.add(Calendar.SECOND, approximatelyInfinity);
    //CMonad.cpsbounce(
    CMonad.cpsbouncewhile(
        new Func0<Boolean>() {
            public Boolean f() {
                return finish.after(Calendar.getInstance());
            }
        },
        new SideEffectful() {
            public void f() {
            }
        });
}

public static void cpsbouncewhile(final Func0<Boolean> loopInvariant, final SideEffectful body) {
    CMonad<Object> mo =
        callcc(
            new Func1<Func1<Object, CMonad<Object>>, CMonad<Object>>() {
                public CMonad<Object> f(final Func1<Object, CMonad<Object>> esc) {
                    final CMonad<Object> start =
                        new CMonad<Object>() {
                            public <R> R f(Func1<Object, R> f) {
                                System.out.println("start.f");
                                if (!loopInvariant.f()) {
                                    System.out.println("done");
                                    return
                                        esc.f(null).f(f);
                                }
                                System.out.println("body.f");
                                body.f();
                                return f.f(null);
                            }
                        };

                    final WhateverClosure<CMonad<Object>> restartHolder = new WhateverClosure<CMonad<Object>>();
                    restartHolder.whatever =
                        map(
                            new Func1<Object, CMonad<Object>>() {
                                public CMonad<Object> f(Object x) {
                                    System.out.println("restart.f");
                                    return
                                        esc.f(
                                            Bounce.wrap(
                                                restartHolder.whatever));
                                }
                            },
                            start);

                    return restartHolder.whatever;
                }
            });
    for (; mo != null;
        mo =
            (CMonad<Object>)mo.f(new Func1<Object, Object>() {
                public Object f(Object x) {
                    return x;
                }
            })) {
        if (!(mo instanceof Bounce)) continue;
        System.out.println("bouncey");
    }
}

So then how'd I do? Well, the test code for while loops is fairly readable, but my goal was to write something like


// T is a priority queue of sorted runs
while (T.size() > 1) {
    T.add(merge(comparator, T.remove(), T.remove());
}

and have that automatically translated to a continuation monad representing the progress of the loop as a function from Booleans (comparison results) to pairs of its own type and pairs of elements to compare next.

Dealing with the while per se was not bad, but translating the body requires quite a lot of fussiness. We need not only a restartable outer (while) loop, but also a restartable inner (merge) loop and we want these flattened so that the while's client code can be oblivious to the implementation choice to use functional composition in the while body. Looking at things a different way, we would like to encapsulate data flow so that a caller can repeatedly supply data to meet the demands of the callee until the callee finishes.

We could use call/cc whenever we need input to obtain a restart wherever we need input. These restarts are relatively easy to pass around (no need for joins), but they only support one standard operation (escape). Accordingly, each adjacent caller/callee pair have to be tweaked to expect/return restarts (analogously to monads in the above paragraph) and each caller has to branch at each invocation to either pass back a restart or move on with its own computation.

Alternatively, we could add a function to represent function application within the continuation monad framework. We could tweak while so that its body argument has a result type of continuation-monad, and then this new apply function could yield such a monad; while could be responsible for composing that result into its own result (using the join operation from MMA -> MA).

Although we're able to dispense with the repetitive branching at function invocations in the call/cc approach by using monads, we can't escape the need to reimplement Java all along the way. If the body of a while loop includes its own function delegation, we'll have make that monadic. If we need a conditional branch, we'll have to make that monadic. Even to implement merge sort we'd wind up with multiple nested anonymous class instance creation expressions. In between scads of braces, parens, brackets, and repetitions of type names like "Func1," we'd have the names of the functions we used to replace all the Java primitives. Lacking syntactic abstraction, the readability of the resulting DSL doesn't scale beyond the simple examples in this blog post.

Friday, December 4, 2009

TopCoding

Since Cross Country Coaches Moss and Brandon had us running sprints, I've realized the importance of short-distance running for long-distance training. So I suppose I've always suspected that TopCoder could be worthwhile, even for those of us who don't care about who can implement bubble sort fastest. Still, it lingered at the bottom of my priority queue. I don't strongly enjoy competition so it wouldn't be first-order fun for me, and I never had trouble thinking of more interesting intellectual challenges than speed-writing fizzbuzz.

But, Niniane told me I should use TopCoder, so I updated that priority. I'm not so haughty as to disregard Coach's advice.

Well, it turns out I was wrong about TopCoder. It's not only fizzbuzz: the problems with high point values do require some care with algorithm and datastructure selection. It's also not so much of a race as I'd thought. You get no points for an incorrect solution, and even very late answers (like with 12 hours of padding on a 75 minute problem) still get 30% credit.

I think coding time is just about the least important thing in programming. I'm well aware of the pitfalls of perfectionism and the whole nuanced worse is better concept. But I've very rarely been in a real-world programming situation in which the difference between 15 minutes and 1 hour made any practical difference, aside from labor cost. On the other hand, I've frequently seen that investing a little time in a thoughtful algorithm, test coverage, and an elegant implementation can pay off hugely. Naturally, there are many different assignments to the decision-theoretic parameters here; it's not surprising that TopCoder's scores are so simplistic (and well-suited to their business model).

I'm TopCoding in Java. Of course, I found myself Greenspunning almost immediately. The TopCoder rules explicitly allow code reuse, while penalizing dead code. Still, it feels like cheating to keep a warm Emacs buffer full of map/reduce/foldr/nth/setdefault around. A newbie will have to write and test that code, losing points against the veteran TopCoder for a reason that has nothing to do with skill.

Having attacked the TopCoder measure of programmer performance, I'm ready to praise it. First, the emphasis on speed accords with interview conditions. The high pressure and short duration of an interview is not ideal for sampling anyone's performance, but it's better than any of the alternatives I know. Second, just as in athletic cross-training, I think there's value in practicing against TopCoder's measure. It works against my natural perfectionism and, because it is so different from my daily routine, it helps me to strengthen in areas that may be seldom used and hence relatively inefficient.

TopCoder's FAQs are spartan, and the UI is confusing for me. There lots of sequential number ranges and acronyms, and after some fruitless web browsing/searching, I resorted to trial and error to infer their meanings. There aren't many context menus and the key bindings don't follow my OS standards (⌘-C/⌘-X/⌘-V) or the emacs defaults. Try ctrl-C/ctrl-X/ctrl-V for copy/cut/paste. The code editor is terrible, but for interview prep I'm using it in preference to Emacs; I won't have a kill ring or isearch-forward available at the whiteboard, either.

Nobody is verbally critiquing you as you tap out your TopCoder solutions, so they provide a compiler and a very rudimentary test framework (you can feed your program arbitrary program inputs and see the result along with whatever it wrote to stdout and stderr). If interview prep is your goal, it's probably still a good idea to do some whiteboarding and paper exercises, but I think the TopCoder environment is impoverished enough that you'll have a heightened awareness of your dependence on compiler and test feedback.

TopCoder is organized as a shallow, fixed-depth hierarchy of chat rooms. The leaf nodes are each associated with three problems of varying difficulty and point value. Problem statements give the general form of input and output along with several example i/o pairs and a sentence or two about each. The lowest-valued problem statements give an algorithm in prose. The statements of higher-valued problems give fewer hints. Also, higher-valued problems tend to require more work (compared to a hypothetical lower-valued problem whose hints have been erased).

There are competition rooms and practice rooms; different rules governing your actions apply depending on the context. Specifically, in competitions, there are discrete, synchronized phases for coding, code review, and a final non-interactive test round; these activities are concurrent for practice rooms. The practice rooms are created in anticipation of competitions and disappear later. It's too bad that the number of rooms and hence the number and variety of available practice problems swings so wildly (from hundreds of problems to 3). There is an archive where you can read the problem statements and look at final stats, but solutions (including your own) and the test oracle are unavailable.

Unsurprisingly, TopCoder problems are less overtly mathematical than those of Project Euler. Whereas my main annoyance with Project Euler was tedium associated with thematic repetition (e.g., primality testing), my main annoyance with TopCoder is tedium associated with parsing the regular language of each problem's input. Project Euler emphasizes mathematical insight and your own preferred blend of elegance/efficiency/creativity/..., while TopCoder emphasizes implementation speed. To be clear, "annoyance" is far from a primary adjective I'd use for either site; I recommend them both for pass-times or as interview preparation.

The "summary" button in each chat room shows you the list of everyone's who's been there. For each person, you can see overall experience level, rank w/r/t the room, and status/earned points for each problem. By double-clicking on a score, you can read that person's solution and optionally challenge it by providing an input for which you think it fails; TopCoder will execute the solution against your input and either cancel the other person's points and reward you, or punish you for your false allegation by subtracting points from your score.

My informal survey of submitted solutions shows that most coders leave sloppy mistakes in their final submissions, reflecting the time pressure and lack of incentive for craftsmanship. There are almost no code comments, but the code usually speaks for itself. I usually remove test and debugging code. I don't revise for clarity or efficiency. It is striking to me that apparently I value conceptual focus more highly than most other TopCoders: my solutions tend to explicitly compose well-known algorithms using the modularity features of my chosen language. Being uninterested in competition, I don't plan to destructure my own code, but I wonder if this is a competitive advantage for them? I like to think that under real-world conditions, my approach yields better reusability and code clarity, but I don't really have empirical evidence for that.

If I have inspired you to try TopCoder, beware that opening any problem statement starts its timer, and closing the problem does not pause or stop the timer. I don't exactly regret having sampled problems from various rooms and point levels before starting, but I also don't care about how I rank relative to anyone else.

Sunday, October 11, 2009

Version numbers and (D)VCS

I've spent a lot of time this weekend trying to adapt apply-version.nant, originally written for oopsnet and svn, to ormar and mercurial. It wasn't so easy to find guidance from others with similar goals, so hopefully this post makes the information more accessible.

Definitions

The primary purpose of software version numbers is to identify the code. When we distribute to users, the version number gives them a concise reference for use in communicating about bugs and available features.

A secondary purpose is to define a temporal order on distributions. Versions are normally numeric and monotonically increasing; higher versions are newer and hopefully better than older versions.

Conventionally, there is at least some human involvement in assigning version numbers. In proprietary software, concerns about sales normally are important. Often times, the technical issue of compatibility is difficult to treat formally, but human judgment is used to encode compatibility information in version numbers.

But it's also conventional to let the low-order bits of a version number be determined automatically. Whenever build inputs change, the behavior of the software may change, but nobody wants to be bothered incrementing version numbers for routine small changes.

Besides version numbers, it's common to hear of "build numbers." Sometimes the terms are used interchangeably, but I think it's useful to distinguish between (a) an identifier for the build input and (b) an identifier for the build event. Some people use (b) as a more convenient proxy for (a), and some people apparently really care about (b) itself, although I'm not sure why. Maybe it's because on some teams deployment is part of the build process*, and it's nice to have a formal record of deployments.

Theory and practice

I've used Subversion for nearly my whole career so far. It's a centralized version control system and sensibly enough it identifies historical events with natural numbers; revision 0 is repository creation. So, svn revision numbers are a very convenient basis for version numbers. Just check for consistency of the wc with a revision from the repository and use that revision number for the low-order version bits. Consistency between wc and repository for svn is a question of (a) uncommitted changes and (b) files or directories within the wc at different revisions.

This is a bit harder to do with some other centralized version control systems. In SCCS, revision numbers (SIDs) apply not to repositories but to individual files. Microsoft TFS has SVN-style revision numbers that they call "changeset numbers," but their implementation choices and tools make it difficult and expensive to answer the wc-repository consistency question. But fundamentally, in a cvcs, there's a global clock and that can serve as a basis for version numbering. In every cvcs I've seen, it's practical to use it that way although it might be easier in some cases (svn) and harder in others (tfs).

For distributed version control systems, we have no global clock. Fundamentally, events are temporally ordered only by causal relationships, so you can really only establish the primary property for version numbers: identifying build inputs. There's no general way to establish the secondary property that allows users to compare version numbers and decide which is newer. And yet, the mercurial project itself produces monotonic version numbers! How do they do it? Apparently by manually tagging.

How important is temporal ordering really? Certainly the most important thing is the capability for repeatable builds. Some DVCS projects have concise identifiers for build inputs; in hg and git we have hashcodes. Unfortunately for those of us on the .NET platform, Microsoft provides only 4 * 16 bits of space for version numbers in their assembly metadata specification. This isn't nearly enough for hg 160 bit changeset ids (though it could accommodate the potentially ambiguous 48 bit short form), especially if we want to use one or more of those four fields for encoding compatibility data.

A common special case

There's a very common special case of projects using dvcs for which we can establish an objective order on versions. There's often an official repository, whose event history meets our need.

Well that's fine in theory, but is it practical? Unfortunately for me, hg doesn't allow remote queries of a given repository's local timestamps ("revision numbers"). I hope that's due to an efficiency trade-off and not just a pedantic effort ("these revision numbers aren't guaranteed to match those in your clone; use changeset ids instead!").

The good news is that in hg, revision numbers consistency is preserved under clone and pull operations. If you commit in your repository, you may irreconcilably lose consistency, but as long as you abstain from making changes you and the other repo will agree on the bijective function between revision numbers and changeset ids. So my plan for .NET assembly versioning in my googlecode hg repositories is to use a pristine clone for official builds and at least one separate clone for feature development and bug fixes.

*For the IT web apps I've worked on, we had automated deployment as part of our CI routine, but we were satisfied to have an svn branch per deployment target. Actually, we had one svn branch per deployment target equivalence class representative. Really we had a small number of user communities (e.g., end-users, beta-testers, trainees, programmers), and we had a branch for each of them and the server clusters for each. (back)

Thursday, October 1, 2009

What I did on my summer vacation

I spent the second week of September 2009 in Yellowstone Country. This is my trip report.

My traveling companions were my parents, my brother, and his girlfriend. We met at BZN, where we rented a minivan for the drive to Yellowstone House. It was a rainy afternoon but between showers we caught glimpses of that signature Montana light that plays over the foothills and dramatically highlights the mountain peaks. We ate dinner at the Chico Dining Room, always a safe bet for fine dining. Yellowstone House has wifi and a PowerMac G4.

For the next two days we went flyfishing with Grossenbacher Guides Bill, Bo, Brad Ehrnman, Brett Seng, and Brian Grossenbacher.

The thing to love or hate about flyfishing is that it emphasizes technique. The motion of the fly is determined not by its own weight (as it is in baitfishing) by the weight distributed along your line; there is physical complexity here that corresponds to a high degree of choice in casting. Beyond that, there is a pretty large space of materials and configurations in the flies, and the trout discriminate taking into account the season, time of day, and past experience. I'm a novice, but more experienced flyfishermen explicitly use ecological knowledge in addition to recent observations of insect activity and fish feeding patterns. So, I can recommend flyfishing if you enjoy skill acquisition.

Someone once expressed surprise that I'd participate in an activity that involves cruelty to fish. Well, I've not noticed a strong trend either way among baitfishermen, but every flyfishermen I've encountered has expressed concern for the health and well-being of the fish. They remove barbs from hooks in order to minimize injury during catch-and-release fishing, they take care to handle fish gently without damaging gills, and they release carefully. Those practices satisfy my moral compass.

We were not the only ones enjoying the river those days; we also observed eagles, ospreys, otters, and mergansers fishing.

On Monday we ate at the Pine Creek Lodge, and it was good. I will say though that the cinnamon fajita tastes like it sounds.

On Tuesday we drove to the North entrance of Yellowstone National Park at Gardiner, MT. On the way through Mammoth Hot Springs to Old Faithful Inn we saw elk, mule deer, bison, and a black bear but very sadly no moose. We stopped off to hike up the Mount Washburn Fire Lookout, where we saw many chipmunks, a couple of marmots, and some big horn sheep. Unfortunately Dad stopped about a half mile from the summit due to pain in his knees.

At

Old Faithful Inn we had drinks and bar food at the bar and then dinner in the Dining Room. I was disappointed that the Crow's Nest is off-limits, but I did see Old Faithful erupt. Our rooms were in a renovated section (early 1990s), so we had private bathrooms.

On Wednesday we hiked half way to

Artist Point along the Grand Canyon of the Yellowstone. Dad stayed back. The canyonlands supposedly harbor moose, but not surprisingly we didn't see them along the heavily trafficked trail. We drove to view a petrified tree and then we drove on to the Lamar Valley. There we photographed a

buffalo at close range and did some fishing in the Lamar River, a tributary of the Yellowstone. At one point I misjudged the depth of the river and waded into chest-high water, drowning my phone.

We tried to eat at Helen's, home of the infamous Hateful Burger, but it was closed and for sale. Instead we ate at Zona Rosa in Livingston, where we had superb Latin American food at very reasonable prices. They were new at that time and did not accept credit cards or serve alcoholic beverages. Nearby landmarks include an enormous Taco John's sign and a car/truck wash.

Thursday we did guided fishing on the Madison River. The fish are generally larger and smarter than those of the Yellowstone. We caught very few whitefish, but a number of

large trout. We ate at the Chico Saloon, which is definitely not served by the same kitchen as the dining room, while we observed some sporting event on TV.

Friday we hung around Yellowstone House, doing a little fishing from the bank of the Yellowstone River and reading and puzzle-solving. We had dinner again at Chico Hot Springs Dining Room, and then back to Yellowstone House for our last night in Montana.

We didn't see any grizzly bears this year, and I still have yet to see wild wolves or moose. Maybe next time.

Sunday, September 13, 2009

The trust relationship between the primary domain and the trusted domain failed.

Over the past month, I've spent a couple of hours puzzling over a bug in the implementation of Microsoft's System.Security.Principal.WindowsPrincipal.IsInRole. I want to say something about the general problem and then I'll talk specifically about this problem instance and how I worked around it.

The general form of the problem is something I often cited in code reviews at Anheuser-Busch. Now I believe that Microsoft owes its success much more to salesmanship than engineering ability, but this is a high-profile product with years of history in the field, so I'm surprised the bug has survived so long. I guess it's worth publishing my advice (which I'm sure did not originate with me):

Throw or propagate exceptions at the right level of abstraction. The Java language designers were on to something with checked exceptions: exceptions really can be regarded as part of your API (see also criticism linked at c2). If you have a method String Profile.GetSetting(String) with two implementations, one of which uses ADO.NET for relational database storage while the other uses System.IO.File for filesystem storage, then callers should never see System.Data.Common.DbExceptions. If, as the maintainer of Profile.GetSetting, you anticipate ADO.NET or System.IO exceptions, then you should catch them and throw exceptions of some more appropriate type(s). It's OK and usually a good idea to use what you catch to initialize the InnerException property of what you throw: that's useful diagnostic information. But, you should make the effort to wrap the exceptions you propagate when doing so mitigates a leaky abstraction.

Use distinct exception types for distinct equivalence classes of conditions you think callers might reasonably treat in different ways. I think it's helpful to think about Lisp conditions, which generalize exceptions, when you make design decisions involving errors.

Document the exceptions you throw.

And without further ado, the specifics:

Bug 4541

Summary:

The trust relationship between the primary domain and the trusted domain failed.

Product:

REDACTED

Reporter:

Foy, Sean

Component:

triage

Assignee:

Foy, Sean

Status:

ASSIGNED

Severity:

blocker

CC:

michael.jorgensen@REDACTED

Priority:

Version:

unspecified

Hardware:

All

OS:

All

URL:

http://tahaze-ww07.REDACTED/REDACTED/requests/19791117T120000?requestor=sean.foy@REDACTED

Whiteboard:

Time tracking:

Orig. Est.	Actual Hours	Hours Worked	Hours Left	%Complete	Gain
8.0	1.1	1.0	0.1	90	6.9

Deadline:

Description Foy, Sean 2009-09-13 19:08:54 CDT

Steps to reproduce:
0. logout if you're not using Windows Authentication.
1. create a request
2. follow the link to the request from the assignment page
3. authenticate (implicitly or otherwise) using Windows Authentication
Expected results: request edit page
Actual results: authorization (IsInRole predicate evaluation on a
WindowsPrincipal) fails.

System.SystemException: The trust relationship between the primary domain and the trusted domain failed.

at
System.Security.Principal.NTAccount.TranslateToSids(IdentityReferenceCollection
sourceAccounts, Boolean& someFailed)
at System.Security.Principal.NTAccount.Translate(IdentityReferenceCollection
sourceAccounts, Type targetType, Boolean& someFailed)
at System.Security.Principal.NTAccount.Translate(IdentityReferenceCollection
sourceAccounts, Type targetType, Boolean forceSuccess)
at System.Security.Principal.WindowsPrincipal.IsInRole(String role)
at MVCUtils.IsInRole(String role) in
c:\documents-and-settings\sean.foy\my-documents\REDACTED\MvcUtils.cs:line 218
at REDACTED.REDACTED.IsInRole(String rolename) in
c:\documents-and-settings\sean.foy\my-documents\REDACTED\REDACTED.cs:line 284
at REDACTED.Viewee03e5938d284b4d944019fc1e9c8130.RenderViewLevel0() in
c:\documents-and-settings\sean.foy\my-documents\REDACTED\Views\Request\requestform.spark:line
20
at REDACTED.Viewee03e5938d284b4d944019fc1e9c8130.RenderView(TextWriter writer) in
c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\Temporary ASP.NET
Files\REDACTED\930b9661\fd62837e49004099a1fddf1e5288ae22-1.cs:line 1240
at Spark.Web.Mvc.SparkView.Render(ViewContext viewContext, TextWriter
writer)
at System.Web.Mvc.ViewResultBase.ExecuteResult(ControllerContext context)
at
System.Web.Mvc.ControllerActionInvoker.InvokeActionResult(ControllerContext
controllerContext, ActionResult actionResult)
at
System.Web.Mvc.ControllerActionInvoker.<>c__DisplayClass11.<invokeactionresultwithfilters>b__e()
at
System.Web.Mvc.ControllerActionInvoker.InvokeActionResultFilter(IResultFilter
filter, ResultExecutingContext preContext, Func`1 continuation)
at
System.Web.Mvc.ControllerActionInvoker.<>c__DisplayClass11.<>c__DisplayClass13.<invokeactionresultwithfilters>b__10()
at
System.Web.Mvc.ControllerActionInvoker.InvokeActionResultWithFilters(ControllerContext
controllerContext, IList`1 filters, ActionResult actionResult)
at System.Web.Mvc.ControllerActionInvoker.InvokeAction(ControllerContext
controllerContext, String actionName)
at System.Web.Mvc.Controller.ExecuteCore()
at System.Web.Mvc.ControllerBase.Execute(RequestContext requestContext)
at
System.Web.Mvc.ControllerBase.System.Web.Mvc.IController.Execute(RequestContext
requestContext)
at System.Web.Mvc.MvcHandler.ProcessRequest(HttpContextBase httpContext)
at System.Web.Mvc.MvcHandler.ProcessRequest(HttpContext httpContext)
at
System.Web.Mvc.MvcHandler.System.Web.IHttpHandler.ProcessRequest(HttpContext
httpContext)
at
System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean&
completedSynchronously)

REDACTED, Version=1.0.4835.11, Culture=neutral, PublicKeyToken=null version
1.0.4835.11

Comment 1 Foy, Sean 2009-09-13 20:08:54 CDT

Additional hours worked: 1.0 There are no documented exceptions for WindowsPrincipal.IsInRole and although Google finds many reports of this exception, the most promising resolutions are vaguely described patches from Microsoft for SharePoint. But empirically I've found that I get reasonable answers for queries of the form
p.IsInRole(@"domain\group") and
p.IsInRole(@"group@domain")
where the domain name can be any non-empty string. But I get this exception for queries of the form
p.IsInRole(@"group").
Evidently IsInRole expects a domain name and throws this exception when none is present.

And here's the patch:


Index: MvcUtils.cs
===================================================================
--- MvcUtils.cs    (revision 4851)
+++ MvcUtils.cs    (working copy)
@@ -1,6 +1,7 @@
 using System;
 using System.Collections.Generic;
 using System.Linq;
+using System.Security.Principal;
 using System.Text;
 using System.Text.RegularExpressions;
 using System.Web;
@@ -214,8 +215,22 @@
         return result.ToString();
     }
 
+    public static Boolean IsInRole(IPrincipal p, String role) {
+        try {
+            return p.Identity.IsAuthenticated &amp;&amp; p.IsInRole(role);
+        }
+        catch (SystemException e) {
+            // see bug 4541
+            if (!e.Message.StartsWith("The trust relationship between the primary domain and the trusted domain failed.")) {
+                throw;
+            }
+            
+            return false;
+        }
+    }
+
     public static Boolean IsInRole(String role) {
-        return HttpContext.Current.User.Identity.IsAuthenticated &amp;&amp; HttpContext.Current.User.IsInRole(role);
+        return IsInRole(HttpContext.Current.User, role);
     }
 }
 
Index: TestMvcUtils.cs
===================================================================
--- TestMvcUtils.cs    (revision 4851)
+++ TestMvcUtils.cs    (working copy)
@@ -175,6 +175,32 @@
                 r => authorized.IsInRole(r)));
     }
 
+    private System.Security.Principal.WindowsPrincipal getWindowsPrincipal() {
+        return
+            new System.Security.Principal.WindowsPrincipal(
+                System.Security.Principal.WindowsIdentity.GetCurrent());
+
+    }
+
+    [Test]
+    public void WindowsPrincipalIsInRoleExpectsDomain() {
+        var p = getWindowsPrincipal();
+        try {
+            p.IsInRole("whatever");
+            Assert.Fail("maybe we can simplify MvcUtils.IsInRole after all");
+        }
+        catch (SystemException) {
+            //expected
+        }
+    }
+
+    [Test]
+    public void IsInRoleToleratesAbsenceOfDomains() {
+        var p = getWindowsPrincipal();
+        MVCUtils.IsInRole(p, @"domain\whatever");
+        MVCUtils.IsInRole(p, "whatever@domain");
+    }
+
     public Mock<httpcontextbase> mockContext(String applicationRoot, String requestUrl) {</httpcontextbase>
         var appRoot = new Uri(applicationRoot);
         var ctx = new Mock<httpcontextbase>();</httpcontextbase>
Index: Web.config
===================================================================
--- Web.config    (revision 4851)
+++ Web.config    (working copy)
@@ -12,8 +12,8 @@
     <add key="ActiveDirectoryForestName" value="REDACTED"></add>
     <add key="ActiveDirectoryUserName" value="REDACTED\svc.tahaze.websql"></add>
     <add key="ActiveDirectoryPassword" value="REDACTED"></add>
-    <add key="map-to-role-msl" value="REDACTED_Form_Editors"></add>
-    <add key="map-to-role-admin" value="REDACTED_Admins"></add>
+    <add key="map-to-role-msl" value="REDACTED\REDACTED_Form_Editors"></add>
+    <add key="map-to-role-admin" value="REDACTED\REDACTED_Admins"></add>
     <add key="MVCAuthnz.CookieAuthenticationHttpModule.signingKey" value="REDACTED"></add>
     <add key="seanfoy.oopsnet.assemblyQualifiedName" value="seanfoy.oopsnet.ASPNetErrorHandler,seanfoy.oopsnet"></add>
     <add key="seanfoy.oopsnet.handlerName" value="handleError"></add>
@@ -84,7 +84,6 @@
     
   
   -->
-
   <system.web></system.web>
     <httpmodules></httpmodules>
       <add name="UrlRoutingModule" type="System.Web.Routing.UrlRoutingModule, System.Web.Routing"></add>

Sunday, August 2, 2009

Paging for Mutable Data

Many web applications provide segmented result sets in response to queries. This approach can conserve server-side effort when users decide to reformulate a query or use partial results rather than actually attending to all the results. Even if a user will eventually demand the complete set of results, paging can smooth the demand/time graph, lowering the peaks.

It's easy to find library support for paging, but it's rare to find good support for paging mutable data. Typically the library provides a set of results given an integer offset and page size. Another way to characterize this functionality is to say that the library does integer arithmetic. I don't object to procedural abstraction, but it seems to me that this is not only simple but simplistic.

We can always enumerate our datasets, but I think that users don't normally care about specific offsets. I for one normally care about the whole set of results that satisfy my query. Or at least I think I do, until I see from the results that my query spec wasn't quite right. Anyway, I think we users normally formulate queries in terms of the problem domain, not in terms of sequences of integers. I don't normally care about the eleventh or the twenty sixth result because of its ordinal rank; I just want the next element of my dataset.

So where's the problem? We know that we can setup a one-to-one correspondence between integers and countable data sets, so why not use convenient integers to specify pages? As long as the data are immutable, or users are tolerant of certain inaccuracies, we can use integers to identify pages.

But, in general the data might be changing even as users peruse the results. Suppose I specify a query q, which predicate is satisfied by the following ordered result elements at time 0: <A, C, E, G, I, J>. At time 1, F begins to satisfy q so the result set becomes <A, C, E, F, G, I, J>. If my page size is 2 and I was looking at the second page at time 0, what should I see when I proceed to the third page at time 2? Do I see G and I because results_t2[(page - 1) * size] = results_t2[4] identifies G? Would the apparent duplicate result in that case be confusing or just annoying? Should I see I and J because I started viewing at t_0 and therefore my page request should be interpreted as results_t0[4] = I?

Suppose an item I've already seen is deleted as I page through results. Will I see the item that would have been first on the next page at all?

If we're going to avoid anomalous disappearances and false duplicates, we can't rely on naively recomputing the function from integers to results on-demand. Perhaps we can store history and use an (page-number, timestamp) pair to index our pages.

If the cost of storing history is too great, there are compromises available aside from allowing both false apparent duplicates and omissions of elements that have existed throughout. We can allow repetition in order to avoid erroneous omission, or we can allow omission to avoid false repetition. Finally, if we disallow duplicates, we can use observable properties of the elements themselves as page indices. Often, (apparent) duplicates are undesirable anyway:

Summaries are often preferred over raw data
A	B
pepperoni	pepperoni: 4
mushroom	mushroom: 2
pepperoni	cheese: 1
mushroom
pepperoni
cheese
pepperoni

In my experience, the most common scenario involves concurrent usage, mutable underlying data, and two laws of user expectation: (1) if an item satisfied the query predicate continuously from before the query was issued to after the user reached the last page of results, the user should see the item and (2) one unchanging item should appear at most once in the pages.

Where these requirements are satisfiable, we'd like to encapsulate and reuse the implementation details. The interface should work in domain terms: given a data element, total order relation, and window size as input, it could provide the next n elements.

The main idea here is pretty simple, both in conception and in implementation. Part of the motivation for this blog post though was to discuss the details of client interface design and testing.

Here's my first test design for the core functionality:


[Test]
public void paging() {
    const int count =
        (int)(
              // first page: stable + deleted
              (pageSize + 1.0) * 3.0 / (1 + 1 + 0) +
              // second page: stable + deleted + ins
              (pageSize + 1.0) * 3.0 / (1 + 1 + 0) +
              // third page: stable + ins
              (pageSize + 1.0) * 3.0 / (1 + 1 + 0));
    var items = Enumerable.Range(0, count).ToList();
    var stable =
        Enumerable.Where(
            items,
            x => x % 3 == 0).ToList();
    var inserting =
        Enumerable.Where(
            items,
            x => x % 3 == 1).ToList();
    var deleting =
        Enumerable.Where(
            items,
            x => x % 3 == 2).ToList();
    var attendance =
        Enumerable.ToDictionary(
            items,
            x => x,
            x => 0);
    try {
        foreach (var i in stable) {
            persist(i);
        }
        foreach (var i in deleting) {
            persist(i);
        }
        int timer = 0;
        foreach (var p in getPages()) {
            foreach (var i in p) {
                ++attendance[i];
            }
            if (timer == 1) {
                foreach (var ins in inserting) {
                    persist(ins);
                }
            }
            else if (timer == 2) {
                foreach (var del in deleting) {
                    delete(del);
                }
            }
            ++timer;
        }
        Assert.IsTrue(timer > 3, "There were too few pages in this test to see both kinds of concurrency anomaly.");
        Assert.IsTrue(
            Enumerable.All(
                stable,
                x => attendance[x] == 1),
            "Every item that is present throughout the process should be seen exactly once.");
        Assert.IsTrue(
            Enumerable.Any(
                deleting,
                x => attendance[x] == 0),
            "Some item deleted during this test should not have been seen. This may indicate eager retrieval of pages, which is isn't feasible in general (for large datasets).");
        Assert.IsFalse(
            Enumerable.Any(
                items,
                x => attendance[x] > 1),
            "No item should be seen twice.");
        // it's OK if items inserted during the test are seen
    }
    finally {
        foreach (var i in items) {
            delete(i);
        }
    }
}

For me, the best thing about this code is its strong focus on two key axioms. I partition all possible inputs in such a way as to demonstrate those, and then I arrange to draw samples from those partitions in what I hope is pretty unobtrusive way. I check that I've covered the necessary cases and then I show that the axioms are true for my sample data. getPages simulates sequential uses of the paging abstraction by a client:


protected override IEnumerable<IEnumerable<int>> getPages() {
    const int pageSize = 10;
    LocalItem memento = null;
    IEnumerable<LocalItem> p;
    do {
        var slicedCriteria =
            Potpourri.Page<LocalItem>.page(
                DetachedCriteria.For(typeof(LocalItem)),
                new [] {"itemNumber"},
                memento);
        p = LocalItem.SlicedFindAll(0, pageSize + 1, slicedCriteria);
        memento =
            Enumerable.LastOrDefault(p);
        yield return
            Potpourri.slice(
                Enumerable.Select(
                    p,
                    i => elementToInt(i)),
                0,
                pageSize);
    } while (Enumerable.Count(p) > pageSize);
}

Details: in real life this code runs against shared relational database instances, but for testing purposes I execute against SQLite for isolation. This test occurs in an abstract TestFixture with two concrete subclasses, one of whose getPages method is shown above.

The other uses the raw NHibernate windowing primitives in the naive way; that test class is marked with the IgnoreAttribute because that approach is doomed. I actually started by writing that test case as an executable demonstration of the motivation for the less obvious technique.

Here's my second test design. I think on balance it's actually less readable than the first attempt. I wonder how many people would agree?


[Test]
public void insertionSansDuplicates() {
    insertion(fetchPageByValueEquality);
}

[Test]
[ExpectedException(typeof(DuplicateDetectedException))]
public void insertionAnomalies() {
    insertion(fetchPageByOffset);
}

[Test]
public void deletionSansSkips() {
    deletion(fetchPageByValueEquality);
}

[Test]
[ExpectedException(typeof(AnomalyDetectedException))]
public void deletionAnomalies() {
    deletion(fetchPageByOffset);
}

Again, I wrote working test code for both straightforward, broken use of primitives and for the approach that suits my needs.

Here's how I address the two kinds of trouble that can result from trying to use offsets for paging in the face of concurrent modification:


private class AnomalyDetectedException : Exception {}
private class DuplicateDetectedException : AnomalyDetectedException {}
private void insertion(FetchPage fetchPage) {
    var q = DetachedCriteria.For<Beer>();
    var O = new [] {"ABV", "BrewDate", "Variety"};
    Paging.AddOrder(q, O);
    var seenit = new HashSet<Beer>();
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var page =
                fetchPage(
                    s,
                    q,
                    O,
                    null,
                    pageSize + 1);
            foreach (Beer i in Enumerable.TakeWhile(page, (elt, i) => i < pageSize)) {
                if (seenit.Contains(i)) throw new DuplicateDetectedException();
                seenit.Add(i);
            }
            var first = (Beer)page[0];
            var nouveau =
                new Beer {
                    Variety = "not" + first.Variety,
                    BrewDate = first.BrewDate,
                    ABV = first.ABV / 2};
            s.Save(nouveau);
            s.Flush();
            page =
                fetchPage(s, q, O, Enumerable.Last(page), pageSize);
            foreach (Beer i in page) {
                if (seenit.Contains(i)) throw new DuplicateDetectedException();
                seenit.Add(i);
            }
        }
    }
}

private void deletion(FetchPage fetchPage) {
    var q = DetachedCriteria.For<Beer>();
    var O = new [] {"ABV", "BrewDate", "Variety"};
    Paging.AddOrder(q, O);
    var expected = new HashSet<Object>();
    var actual = new HashSet<Object>();
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var page = fetchPage(s, q, O, null, pageSize);
            foreach (var i in page) expected.Add(i);
            foreach (var i in page) actual.Add(i);
            foreach (var i in fetchPage(s, q, O, Enumerable.Last(page), pageSize)) {
                expected.Add(i);
            }
            s.Delete(page[0]);
            s.Flush();
            foreach (var i in fetchPage(s, q, O, Enumerable.Last(page), pageSize)) {
                actual.Add(i);
            }

            actual.SymmetricExceptWith(expected);
            if (0 != actual.Count) throw new AnomalyDetectedException();
        }
    }
}

The test data are much less carefully prepared:


[TestFixtureSetUp]
public void setup() {
    Persistence.rebuildDB();
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            Assert.That(s.Linq<Bottle>().Count(), Is.GreaterThan(pageSize));
            Assert.That(s.Linq<Beer>().Count(), Is.GreaterThan(pageSize));
        }
    }
}

That rebuildDB method isn't particularly geared towards the paging problem. It's really intended for use with a broader class of NHibernate-related addons.

I also have boundary test cases:


[Test]
public void empty() {
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var max =
                Queryable.Max(
                    s.Linq<Bottle>(),
                    i => i.Serial);
            var source =
                DetachedCriteria.For(
                    typeof(Bottle));
            Assert.IsFalse(
                Enumerable.Any(
                    Paging.Page(
                        s,
                        source,
                        new [] {"Serial"},
                        new Bottle { Serial = max + 1 },
                        pageSize)));
        }
    }
}

[Test]
public void singleton() {
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var max =
                Queryable.Max(
                    s.Linq<Bottle>(),
                    i => i.Serial);
            var source =
                DetachedCriteria.For(
                    typeof(Bottle)).
                AddOrder(Order.Asc("Serial"));
            Assert.AreEqual(
                1,
                Enumerable.Count(
                    Paging.Page(
                        s,
                        source,
                        new [] {"Serial"},
                        new Bottle { Serial = max },
                        pageSize)));
        }
    }
}

When there's no concurrent data modification, we get equivalent results with the simpler, direct use of NHibernate primitives as with the more complicated use of my paging abstraction:


[Test]
public void naturals() {
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var Q =
                Enumerable.Select(
                    Enumerable.Range(0, 2),
                    x => DetachedCriteria.For<Bottle>().AddOrder(Order.Asc("Serial"))).ToArray();
            Assert.AreEqual(
                Q[0].GetExecutableCriteria(s).SetMaxResults(pageSize).List<Bottle>(),
                Paging.Page<Bottle>(s, Q[1], new [] {"Serial"}, null, pageSize));
        }
    }
}

[Test]
public void beers() {
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var O = new [] {"ABV", "BrewDate", "Variety"};
            // ICriterion instances are mutable and don't
            // expose a copy constructor or clone method,
            // so we must make two criteria, just alike.
            var Q =
                Enumerable.Select(
                    Enumerable.Range(0, 2),
                    x => DetachedCriteria.For<Beer>()).ToArray();
            foreach (var q in Q) {
                Paging.AddOrder(q, O);
            }
            Assert.AreEqual(
                Q[0].GetExecutableCriteria(s).SetMaxResults(pageSize).List<Beer>(),
                Paging.Page(s, Q[1], O, null, pageSize));
        }
    }
}

I welcome your comments and criticism. Please see ormar for details or to publish your own code review.

vogon fodder

In this post, I show-and-tell about automated paper-pushing.

I spent considerable effort at work last week building something that is minimally useful. I am preoccupied by the conflict between the demands of professionalism and the idiotic/cynical mockery of professionalism that is usual (though not universal) in large organizations. But I've got to eat and I should focus on problems I can realistically solve. For now, let's accept the demands of management without question and focus on how to satisfy the demands at minimum cost to human life. In the words of Steve Jobs:

Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day. Over a year, that's probably dozens of lifetimes. So if you make it boot ten seconds faster, you've saved a dozen lives. That's really worth it, don't you think?

Your manager emails you an MSWord document and asks you fill it out once per software deployment. It should list every affected "object" and indicate when if ever "User" has been notified, when and who if ever "BAM" has approved, and it should reference the other paperwork associated with the project. What do you do?

Clearly, the right answer involves CI and LaTeX. In my case: svn, ccnet, Bugzilla, python, xslt, ruby, LaTeX, perl, and pdf.

We've been using CI with SVN to do deployments since 2004. There's a branch for each deployment target, and a CI project associated with each of those branches. The CI builds usually correspond to a single commit, but we don't insist on that; when commits come fast enough we go for speed over precision in who's "on the hook" for breaking changes. So the first problem is to find the union of affected "objects" over the set of svn commits in the deployment.

The term "object" is taken from the argot of the RPGers, who are the oldest and largest group of software people at our company. For them, "object" often refers to a printed tabular summary of data but more generally can refer to a program for any purpose. The usual custom outside the RPG tribe is to list as "objects" ui elements, but sometimes instead OO classes or source files are listed. It's not worth being too careful about this, because the real goal is pro forma compliance with a vague demand for documentation. Users basically want release notes, and engineers already get precise, accurate, and efficient answers to technical questions from svn and Bugzilla. The form really just needs to exist and, at a glance, to appear normal.

Our CI system CruiseControl.NET has a feature called modification writer that renders metadata about a build to XML. This provides a list of added, deleted, and textually-modified files for each commit that's new since the last build:

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfModification xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Modification>
  <Type>Modified</Type>
  <FileName>ccnet-self.config</FileName>
  <FolderName>/ccnet</FolderName>
  <ModifiedTime>2009-07-24T10:23:08.162158-05:00</ModifiedTime>
  <UserName>sean.foy</UserName>
  <ChangeNumber>4413</ChangeNumber>
  <Version />
  <Comment>whitespace
[Bug 50,51,3783]</Comment>
  <IssueUrl>https://bugzilla.thcg.net/buglist.cgi?bug_id=50,51,3783</IssueUrl>
</Modification>
</ArrayOfModification>

We integrated SVN and Bugzilla in 2004, and so since that time we've been able to correlate commits with bugs. This gives us informal commentary about the motivations and the design considerations for each commit, plus metadata including who was involved in the process, how much time each of us spent, whether we paired or got code review, whether dbas got involved, who approved deployment, etc. Well, at least in principle. Actually, only one or two projects have ever really taken advantage of these capabilities. But the point is that this metadata fits the purported goal of our paperwork, and it's useful and easily accessible, and it exists for at least one project. It's the best we can do, and it's pretty good. So we've got to use it.

Python is a fairly pleasant language to work in and nobody's terrified by its syntax. It also has great libraries for screen-scraping. Accordingly, we used a python program to augment ccnet's modifications xml. We read the ccnet xml and use the bugtraq standard to write back the set of relevant bugs:

<ArrayOfModification>
<Modification>
  <Type>Modified</Type>
  <FileName>ccnet-self.config</FileName>
  <FolderName>/ccnet</FolderName>
  <ModifiedTime>2009-07-24T10:23:08.162158-05:00</ModifiedTime>
  <UserName>sean.foy</UserName>
  <ChangeNumber>4413</ChangeNumber>
  <Version />
  <Comment>whitespace
[Bug 50,51,3783]</Comment>
  <IssueUrl>https://bugzilla.thcg.net/buglist.cgi?bug_id=50,51,3783</IssueUrl>
  <database />
  <flags>
    <flag id="1882" name="QCApproved" setter="michael.jorgensen@covidien.com" status="+" />
  </flags>
</Modification>
</ArrayOfModification>

Now we have all the answers we need to fill out our worksheet in XML. We just need to produce MSWord output (good luck). Well, MSWord probably isn't really necessary. That's what's expected, but also we should keep in mind that the goal is pro forma compliance. This document's real use is to ward off auditors, so it just needs to exist and not look unusual. So, we want to typeset XML? Obviously we should consider DocBook, XSL-FO, and LaTeX. I looked into XSL-FO c. 2001 and DocBook c. 2002, but I know LaTeX best. I wanted to spend less than 4 hours on this little project, so I was satisfied to use LaTeX.

For processing XML, you could do worse than XSLT. It can be particularly nice as a declarative template specification: "make my input look like this". So, I used xslt to process the bz-augmented ccnet modifications xml into LaTeX:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:smf="http://seanfoy.thcg.net/xslt">
<xsl:output method="text" />

<xsl:param name="asr">application service request number</xsl:param>
<xsl:param name="builddate">the date these changes became effective</xsl:param>
<xsl:param name="trunkdeploymenttarget">unc path</xsl:param>
<xsl:param name="trunkdbserver">hostname</xsl:param>
<xsl:param name="trunkdbname">dbname</xsl:param>
<xsl:param name="productiondeploymenttarget">unc path</xsl:param>
<xsl:param name="productiondbserver">hostname</xsl:param>
<xsl:param name="productiondbname">dbname</xsl:param>

<xsl:function name="smf:latexlit">
  <xsl:param name="raw" />
  <xsl:value-of select="fn:replace(fn:replace($raw, '([$&amp;%#_{}~^\\])', '\\$1'), '\\\\', '\$\\backslash\$')" />
</xsl:function>

<xsl:template match="@*|node()">
</xsl:template>
<xsl:template match="/">
  <xsl:text>
\documentclass{article}

\usepackage{hyperref}
\usepackage{longtable}
\usepackage{latexsym}

\usepackage[document]{ragged2e}
\usepackage{wordlike}
\renewcommand{\familydefault}{phv} % Arial
\sloppy
\makeatletter
\renewcommand{\subsection}{\@startsection%
{subsection}%
{1}
{0em}
{-\baselineskip}%
{0.5\baselineskip}%
{\bfseries\LARGE\sffamily}}
\makeatother

\author{Sean Foy}
\title{Sample}

\begin{document}
\section*{Internet/Intranet Production Move Request}
\textbf{ASR /Track-it WO\#:} </xsl:text><xsl:value-of select="smf:latexlit($asr)" /><xsl:text>\hfill\textbf{Move Date:}

\textbf{Requestor/ User Notification Sent:} $\Box$ \textbf{Dated:}

\subsection*{Object(s): $\Box$Web \qquad $\Box$WebMethods (WM)}
\begin{longtable}{|l|l|}\hline
  </xsl:text>
  <xsl:for-each select="ArrayOfModification/Modification">
    <xsl:value-of select="smf:latexlit(FileName)" />
    <xsl:text>&amp; -r</xsl:text><xsl:value-of select="smf:latexlit(ChangeNumber)" />
    <xsl:if test="fn:not(fn:empty(IssueUrl))">
      <xsl:text> (</xsl:text><xsl:value-of select="smf:latexlit(IssueUrl)" /><xsl:text>)</xsl:text>
    </xsl:if>
    <xsl:text> \\</xsl:text>
  </xsl:for-each>
  <xsl:text>\hline
\end{longtable}

\subsection*{DB Object(s):}
\begin{longtable}{|l|l|}\hline
  </xsl:text>
  <xsl:for-each select="ArrayOfModification/Modification[database]">
    <xsl:value-of select="smf:latexlit(FileName)" />
    <xsl:text>&amp; -r</xsl:text><xsl:value-of select="smf:latexlit(ChangeNumber)" />
    <xsl:if test="fn:not(fn:empty(IssueUrl))">
      <xsl:text> (</xsl:text><xsl:value-of select="smf:latexlit(IssueUrl)" /><xsl:text>)</xsl:text>
    </xsl:if>
    <xsl:text> \\</xsl:text>
  </xsl:for-each>
  <xsl:text>
    \hline
\end{longtable}

\subsection*{Deployment Servers:}
\begin{tabular}{|l|l|l|}\hline
     &amp;
     Development&amp;
     Production\\\hline
     Web:&amp;
     \textbf{UNC:} </xsl:text><xsl:value-of select="smf:latexlit($trunkdeploymenttarget)" /><xsl:text>&amp;
     \textbf{UNC:} </xsl:text><xsl:value-of select="smf:latexlit($productiondeploymenttarget)" /><xsl:text>\\\hline
     DB:&amp;
     \textbf{Server:} </xsl:text><xsl:value-of select="smf:latexlit($trunkdbserver)" /><xsl:text> \textbf{Database:} </xsl:text><xsl:value-of select="smf:latexlit($trunkdbname)" /><xsl:text>&amp;
     \textbf{Server:} </xsl:text><xsl:value-of select="smf:latexlit($productiondbserver)" /><xsl:text> \textbf{Database:} </xsl:text><xsl:value-of select="smf:latexlit($productiondbname)" /><xsl:text>\\\hline
     WM:&amp;
     \textbf{UNC:}            &amp;
     \textbf{UNC:}            \\\hline
\end{tabular}

\rule{\textwidth}{0.25em}

\textbf{Move Approved by BAM}:

Signature: \underline{\makebox[2in][l]{}}

Date: \underline{\makebox[1in][l]{}}

\end{document}
  </xsl:text>
</xsl:template>
</xsl:stylesheet>

LaTeX is based on Don Knuth's TeX typesetting system. It's serious business: Turing-complete and designed specifically for the needs of the world's most demanding typographical experts.

For a while I thought about merging the bug and svn data here in LaTeX. My TeX-hacking skills are very weak now so I had in mind the python package. As I escaped for XSL the escape for LaTeX of the escape for Python of the LaTeX output I wanted, I thought I was probably making a mistake. LaTeX is great for typesetting, and capable of general computation, but it's awkward to express the heavy-lifting there (even in an embedded language). The python package is probably better used tactically, for simple comprehensions and the like -- not for elaborating significantly on the structure of the LaTeX markup.

We must compile LaTeX to obtain our output (normally PostScript or PDF). I use a perl script, latexmk, to manage the multi-phase compilation process and keep track of cleaning up the intermediate products.

So we want to combine (((xml and Buzilla) by python to xml) and xsl) by xslt to LaTeX by latexmk to PDF). We need to process our inputs and intermediate products in topological order; this is a build process. I've used make before, but it's a bit crufty. I prefer NAnt, whose XML syntax is more explicit and discoverable. From what I've seen of rake, I like it better than NAnt. NAnt's a priori boundary between its implementation and your build specification is confining. Rake draws no distinction between itself and your build specification.

# -*- mode: ruby -*-
# Render modifications as beautiful
# (or hideous!) PDF
#
# sample invocation:
# rake latexmk=vendor/latexmk/latexmk.pl modifications.pdf

require 'rake/clean'
require 'rexml/document'
require 'rexml/xpath'

transform = "#{ENV['transform'] || 'vendor/saxonb9-1-0-7n/bin/Transform.exe'}"
latexmk = "#{ENV['latexmk'] || 'vendor/latexmk/latexmk.pl'}"

modifications_xml = "#{ENV['modifications_xml'] || '../Artifacts/modifications.xml'}"

task :default => 'modifications.pdf'

task :clean => ['clobberhelper'] do
end

task :clobber => ['clobberhelper'] do
end

def contains_clobber(ts)
ts.each { |i|
  return true if (i == 'clobber') || contains_clobber(Rake::Task[i].prerequisites)
}
return false
end

task :clobberhelper do |t|
if contains_clobber(Rake::application().top_level())
  sh latexmk, '-C'
else
  sh latexmk, '-c'
end
end

file 'modifications.tex' => [modifications_xml, 'modifications.xsl'] do |t|
xsl_params = {}
begin
  f = File.new('modifications.xsl')
  doc = REXML::Document.new(f)
  REXML::XPath.each(doc.root, 'xsl:param', {'xsl' => 'http://www.w3.org/1999/XSL/Transform'}) do |elt|
    xsl_params[elt.attributes['name']] = ENV[elt.attributes['name']] || elt.text
  end
ensure
  f.close unless f.nil?
end
xsl_args = xsl_params.inject([]){ |accum, (key, value)|
  accum << "#{key}=#{value}"
}
sh transform, "-s:#{modifications_xml}", '-xsl:modifications.xsl', '-o:modifications.tex', *xsl_args
end
CLEAN.include('modifications.tex')

file 'modifications.pdf' => ['modifications.tex'] do |t|
sh latexmk, '-pdf', 'modifications.tex'
end

If you don't know about rake, one point of confusion is that rake clean conventionally means "remove intermediates" while rake clobber means "remove intermediate and final outputs".

With my use of latexmk, I encountered a technical problem in the implementation of the standard clean and clobber rake tasks: latexmk needs the .tex file to decide what to clean up, but rake wants to clean (removing the .tex) before it clobbers. I needed to let latexmk do its job before clean, and latexmk needs to know whether to clean up the PDF or just the intermediates. My solution was to add a dependency clobberhelper for the clean task to call latexmk's cleanup routine. clobberhelper searches rake's dependency graph for clobber in order to decide how to call latexmk.

Finally, we have a corporate commitment to .NET and so most of our projects benefit from NAnt's understanding of .NET. We want an easy way to use this motley assortment of tools with our existing NAnt build files:

<?xml version="1.0" encoding="utf-8"?>
<project name="sdim" default="move-worksheet">
<property name="artifacts_dir" value="." overwrite="false" />
<!-- nant for Windows doesn't use the cygwin path so
     you need to chdir to %cygwin_home%\bin and then
     run %cygwin_home%\path\to\python or
     install Windows python and do e.g.,
     C:\Python31\python.exe -->
<property name="python" value="C:\Python31\python.exe" overwrite="false" />
<property name="rake" value="C:\Ruby\bin\rake.bat" overwrite="false" />
<property name="modifications_xml" value="modifications.xml" overwrite="false" />

<target name="move-worksheet">
  <exec program="${python}">
    <arg line="bzaugment.py ${path::combine(artifacts_dir, modifications_xml)} -o ${path::combine(artifacts_dir, 'modifications.augmented.xml')}" />
  </exec>
  <exec program="${rake}">
    <arg line="modifications_xml=${path::combine(artifacts_dir, 'modifications.augmented.xml')}" />
  </exec>
</target>

<target name="clean">
  <exec program="${rake}">
    <arg line="clobber" />
  </exec>
  <delete file="${path::combine(artifacts_dir, 'modifications.augmented.xml')}" />
</target>
</project>

I still have some reservations about the morality of this. On the one hand, no human should waste even 30 minutes of precious life preparing ambiguous, incomplete, inaccessible and uninsightful summaries of data that's already available, accurate, complete, and easily-accessible. On the other hand, you can lead a manager to water but you can't make him drink. This solution seems to balance those two forces, but by reducing the cost of useless redundancy, am I encouraging further bureaucratic malignancy?

Viola:

Anyway, I hope you enjoyed the tour of this motley assemblage of languages and tools.

Thanks to Josh Buedel for comments and suggestions during the preparation of this post.

Sean Foy

Wednesday, February 10, 2010

Continuation Monad

Comparison-based sorting with variable costs

Sidebar: true story

Continuation monad

Friday, December 4, 2009

TopCoding

Sunday, October 11, 2009

Version numbers and (D)VCS

Definitions

Theory and practice

A common special case

Thursday, October 1, 2009

What I did on my summer vacation

Sunday, September 13, 2009

The trust relationship between the primary domain and the trusted domain failed.

Bug 4541

Sunday, August 2, 2009

Paging for Mutable Data

vogon fodder

Blog Archive

Followers

feeds I follow