Sean Foy: Paging for Mutable Data

Many web applications provide segmented result sets in response to queries. This approach can conserve server-side effort when users decide to reformulate a query or use partial results rather than actually attending to all the results. Even if a user will eventually demand the complete set of results, paging can smooth the demand/time graph, lowering the peaks.

It's easy to find library support for paging, but it's rare to find good support for paging mutable data. Typically the library provides a set of results given an integer offset and page size. Another way to characterize this functionality is to say that the library does integer arithmetic. I don't object to procedural abstraction, but it seems to me that this is not only simple but simplistic.

We can always enumerate our datasets, but I think that users don't normally care about specific offsets. I for one normally care about the whole set of results that satisfy my query. Or at least I think I do, until I see from the results that my query spec wasn't quite right. Anyway, I think we users normally formulate queries in terms of the problem domain, not in terms of sequences of integers. I don't normally care about the eleventh or the twenty sixth result because of its ordinal rank; I just want the next element of my dataset.

So where's the problem? We know that we can setup a one-to-one correspondence between integers and countable data sets, so why not use convenient integers to specify pages? As long as the data are immutable, or users are tolerant of certain inaccuracies, we can use integers to identify pages.

But, in general the data might be changing even as users peruse the results. Suppose I specify a query q, which predicate is satisfied by the following ordered result elements at time 0: <A, C, E, G, I, J>. At time 1, F begins to satisfy q so the result set becomes <A, C, E, F, G, I, J>. If my page size is 2 and I was looking at the second page at time 0, what should I see when I proceed to the third page at time 2? Do I see G and I because results_t2[(page - 1) * size] = results_t2[4] identifies G? Would the apparent duplicate result in that case be confusing or just annoying? Should I see I and J because I started viewing at t_0 and therefore my page request should be interpreted as results_t0[4] = I?

Suppose an item I've already seen is deleted as I page through results. Will I see the item that would have been first on the next page at all?

If we're going to avoid anomalous disappearances and false duplicates, we can't rely on naively recomputing the function from integers to results on-demand. Perhaps we can store history and use an (page-number, timestamp) pair to index our pages.

If the cost of storing history is too great, there are compromises available aside from allowing both false apparent duplicates and omissions of elements that have existed throughout. We can allow repetition in order to avoid erroneous omission, or we can allow omission to avoid false repetition. Finally, if we disallow duplicates, we can use observable properties of the elements themselves as page indices. Often, (apparent) duplicates are undesirable anyway:

Summaries are often preferred over raw data
A	B
pepperoni	pepperoni: 4
mushroom	mushroom: 2
pepperoni	cheese: 1
mushroom
pepperoni
cheese
pepperoni

In my experience, the most common scenario involves concurrent usage, mutable underlying data, and two laws of user expectation: (1) if an item satisfied the query predicate continuously from before the query was issued to after the user reached the last page of results, the user should see the item and (2) one unchanging item should appear at most once in the pages.

Where these requirements are satisfiable, we'd like to encapsulate and reuse the implementation details. The interface should work in domain terms: given a data element, total order relation, and window size as input, it could provide the next n elements.

The main idea here is pretty simple, both in conception and in implementation. Part of the motivation for this blog post though was to discuss the details of client interface design and testing.

Here's my first test design for the core functionality:


[Test]
public void paging() {
    const int count =
        (int)(
              // first page: stable + deleted
              (pageSize + 1.0) * 3.0 / (1 + 1 + 0) +
              // second page: stable + deleted + ins
              (pageSize + 1.0) * 3.0 / (1 + 1 + 0) +
              // third page: stable + ins
              (pageSize + 1.0) * 3.0 / (1 + 1 + 0));
    var items = Enumerable.Range(0, count).ToList();
    var stable =
        Enumerable.Where(
            items,
            x => x % 3 == 0).ToList();
    var inserting =
        Enumerable.Where(
            items,
            x => x % 3 == 1).ToList();
    var deleting =
        Enumerable.Where(
            items,
            x => x % 3 == 2).ToList();
    var attendance =
        Enumerable.ToDictionary(
            items,
            x => x,
            x => 0);
    try {
        foreach (var i in stable) {
            persist(i);
        }
        foreach (var i in deleting) {
            persist(i);
        }
        int timer = 0;
        foreach (var p in getPages()) {
            foreach (var i in p) {
                ++attendance[i];
            }
            if (timer == 1) {
                foreach (var ins in inserting) {
                    persist(ins);
                }
            }
            else if (timer == 2) {
                foreach (var del in deleting) {
                    delete(del);
                }
            }
            ++timer;
        }
        Assert.IsTrue(timer > 3, "There were too few pages in this test to see both kinds of concurrency anomaly.");
        Assert.IsTrue(
            Enumerable.All(
                stable,
                x => attendance[x] == 1),
            "Every item that is present throughout the process should be seen exactly once.");
        Assert.IsTrue(
            Enumerable.Any(
                deleting,
                x => attendance[x] == 0),
            "Some item deleted during this test should not have been seen. This may indicate eager retrieval of pages, which is isn't feasible in general (for large datasets).");
        Assert.IsFalse(
            Enumerable.Any(
                items,
                x => attendance[x] > 1),
            "No item should be seen twice.");
        // it's OK if items inserted during the test are seen
    }
    finally {
        foreach (var i in items) {
            delete(i);
        }
    }
}

For me, the best thing about this code is its strong focus on two key axioms. I partition all possible inputs in such a way as to demonstrate those, and then I arrange to draw samples from those partitions in what I hope is pretty unobtrusive way. I check that I've covered the necessary cases and then I show that the axioms are true for my sample data. getPages simulates sequential uses of the paging abstraction by a client:


protected override IEnumerable<IEnumerable<int>> getPages() {
    const int pageSize = 10;
    LocalItem memento = null;
    IEnumerable<LocalItem> p;
    do {
        var slicedCriteria =
            Potpourri.Page<LocalItem>.page(
                DetachedCriteria.For(typeof(LocalItem)),
                new [] {"itemNumber"},
                memento);
        p = LocalItem.SlicedFindAll(0, pageSize + 1, slicedCriteria);
        memento =
            Enumerable.LastOrDefault(p);
        yield return
            Potpourri.slice(
                Enumerable.Select(
                    p,
                    i => elementToInt(i)),
                0,
                pageSize);
    } while (Enumerable.Count(p) > pageSize);
}

Details: in real life this code runs against shared relational database instances, but for testing purposes I execute against SQLite for isolation. This test occurs in an abstract TestFixture with two concrete subclasses, one of whose getPages method is shown above.

The other uses the raw NHibernate windowing primitives in the naive way; that test class is marked with the IgnoreAttribute because that approach is doomed. I actually started by writing that test case as an executable demonstration of the motivation for the less obvious technique.

Here's my second test design. I think on balance it's actually less readable than the first attempt. I wonder how many people would agree?


[Test]
public void insertionSansDuplicates() {
    insertion(fetchPageByValueEquality);
}

[Test]
[ExpectedException(typeof(DuplicateDetectedException))]
public void insertionAnomalies() {
    insertion(fetchPageByOffset);
}

[Test]
public void deletionSansSkips() {
    deletion(fetchPageByValueEquality);
}

[Test]
[ExpectedException(typeof(AnomalyDetectedException))]
public void deletionAnomalies() {
    deletion(fetchPageByOffset);
}

Again, I wrote working test code for both straightforward, broken use of primitives and for the approach that suits my needs.

Here's how I address the two kinds of trouble that can result from trying to use offsets for paging in the face of concurrent modification:


private class AnomalyDetectedException : Exception {}
private class DuplicateDetectedException : AnomalyDetectedException {}
private void insertion(FetchPage fetchPage) {
    var q = DetachedCriteria.For<Beer>();
    var O = new [] {"ABV", "BrewDate", "Variety"};
    Paging.AddOrder(q, O);
    var seenit = new HashSet<Beer>();
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var page =
                fetchPage(
                    s,
                    q,
                    O,
                    null,
                    pageSize + 1);
            foreach (Beer i in Enumerable.TakeWhile(page, (elt, i) => i < pageSize)) {
                if (seenit.Contains(i)) throw new DuplicateDetectedException();
                seenit.Add(i);
            }
            var first = (Beer)page[0];
            var nouveau =
                new Beer {
                    Variety = "not" + first.Variety,
                    BrewDate = first.BrewDate,
                    ABV = first.ABV / 2};
            s.Save(nouveau);
            s.Flush();
            page =
                fetchPage(s, q, O, Enumerable.Last(page), pageSize);
            foreach (Beer i in page) {
                if (seenit.Contains(i)) throw new DuplicateDetectedException();
                seenit.Add(i);
            }
        }
    }
}

private void deletion(FetchPage fetchPage) {
    var q = DetachedCriteria.For<Beer>();
    var O = new [] {"ABV", "BrewDate", "Variety"};
    Paging.AddOrder(q, O);
    var expected = new HashSet<Object>();
    var actual = new HashSet<Object>();
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var page = fetchPage(s, q, O, null, pageSize);
            foreach (var i in page) expected.Add(i);
            foreach (var i in page) actual.Add(i);
            foreach (var i in fetchPage(s, q, O, Enumerable.Last(page), pageSize)) {
                expected.Add(i);
            }
            s.Delete(page[0]);
            s.Flush();
            foreach (var i in fetchPage(s, q, O, Enumerable.Last(page), pageSize)) {
                actual.Add(i);
            }

            actual.SymmetricExceptWith(expected);
            if (0 != actual.Count) throw new AnomalyDetectedException();
        }
    }
}

The test data are much less carefully prepared:


[TestFixtureSetUp]
public void setup() {
    Persistence.rebuildDB();
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            Assert.That(s.Linq<Bottle>().Count(), Is.GreaterThan(pageSize));
            Assert.That(s.Linq<Beer>().Count(), Is.GreaterThan(pageSize));
        }
    }
}

That rebuildDB method isn't particularly geared towards the paging problem. It's really intended for use with a broader class of NHibernate-related addons.

I also have boundary test cases:


[Test]
public void empty() {
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var max =
                Queryable.Max(
                    s.Linq<Bottle>(),
                    i => i.Serial);
            var source =
                DetachedCriteria.For(
                    typeof(Bottle));
            Assert.IsFalse(
                Enumerable.Any(
                    Paging.Page(
                        s,
                        source,
                        new [] {"Serial"},
                        new Bottle { Serial = max + 1 },
                        pageSize)));
        }
    }
}

[Test]
public void singleton() {
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var max =
                Queryable.Max(
                    s.Linq<Bottle>(),
                    i => i.Serial);
            var source =
                DetachedCriteria.For(
                    typeof(Bottle)).
                AddOrder(Order.Asc("Serial"));
            Assert.AreEqual(
                1,
                Enumerable.Count(
                    Paging.Page(
                        s,
                        source,
                        new [] {"Serial"},
                        new Bottle { Serial = max },
                        pageSize)));
        }
    }
}

When there's no concurrent data modification, we get equivalent results with the simpler, direct use of NHibernate primitives as with the more complicated use of my paging abstraction:


[Test]
public void naturals() {
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var Q =
                Enumerable.Select(
                    Enumerable.Range(0, 2),
                    x => DetachedCriteria.For<Bottle>().AddOrder(Order.Asc("Serial"))).ToArray();
            Assert.AreEqual(
                Q[0].GetExecutableCriteria(s).SetMaxResults(pageSize).List<Bottle>(),
                Paging.Page<Bottle>(s, Q[1], new [] {"Serial"}, null, pageSize));
        }
    }
}

[Test]
public void beers() {
    using (var factory = Persistence.makeSessionFactory()) {
        using (var s = factory.OpenSession()) {
            var O = new [] {"ABV", "BrewDate", "Variety"};
            // ICriterion instances are mutable and don't
            // expose a copy constructor or clone method,
            // so we must make two criteria, just alike.
            var Q =
                Enumerable.Select(
                    Enumerable.Range(0, 2),
                    x => DetachedCriteria.For<Beer>()).ToArray();
            foreach (var q in Q) {
                Paging.AddOrder(q, O);
            }
            Assert.AreEqual(
                Q[0].GetExecutableCriteria(s).SetMaxResults(pageSize).List<Beer>(),
                Paging.Page(s, Q[1], O, null, pageSize));
        }
    }
}

I welcome your comments and criticism. Please see ormar for details or to publish your own code review.

Sean Foy

Sunday, August 2, 2009

Paging for Mutable Data

No comments:

Post a Comment

Blog Archive

Followers

feeds I follow