Sunday, August 2, 2009

Paging for Mutable Data

Many web applications provide segmented result sets in response to queries. This approach can conserve server-side effort when users decide to reformulate a query or use partial results rather than actually attending to all the results. Even if a user will eventually demand the complete set of results, paging can smooth the demand/time graph, lowering the peaks.

It's easy to find library support for paging, but it's rare to find good support for paging mutable data. Typically the library provides a set of results given an integer offset and page size. Another way to characterize this functionality is to say that the library does integer arithmetic. I don't object to procedural abstraction, but it seems to me that this is not only simple but simplistic.

We can always enumerate our datasets, but I think that users don't normally care about specific offsets. I for one normally care about the whole set of results that satisfy my query. Or at least I think I do, until I see from the results that my query spec wasn't quite right. Anyway, I think we users normally formulate queries in terms of the problem domain, not in terms of sequences of integers. I don't normally care about the eleventh or the twenty sixth result because of its ordinal rank; I just want the next element of my dataset.

So where's the problem? We know that we can setup a one-to-one correspondence between integers and countable data sets, so why not use convenient integers to specify pages? As long as the data are immutable, or users are tolerant of certain inaccuracies, we can use integers to identify pages.

But, in general the data might be changing even as users peruse the results. Suppose I specify a query q, which predicate is satisfied by the following ordered result elements at time 0: <A, C, E, G, I, J>. At time 1, F begins to satisfy q so the result set becomes <A, C, E, F, G, I, J>. If my page size is 2 and I was looking at the second page at time 0, what should I see when I proceed to the third page at time 2? Do I see G and I because results_t2[(page - 1) * size] = results_t2[4] identifies G? Would the apparent duplicate result in that case be confusing or just annoying? Should I see I and J because I started viewing at t_0 and therefore my page request should be interpreted as results_t0[4] = I?

Suppose an item I've already seen is deleted as I page through results. Will I see the item that would have been first on the next page at all?

If we're going to avoid anomalous disappearances and false duplicates, we can't rely on naively recomputing the function from integers to results on-demand. Perhaps we can store history and use an (page-number, timestamp) pair to index our pages.

If the cost of storing history is too great, there are compromises available aside from allowing both false apparent duplicates and omissions of elements that have existed throughout. We can allow repetition in order to avoid erroneous omission, or we can allow omission to avoid false repetition. Finally, if we disallow duplicates, we can use observable properties of the elements themselves as page indices. Often, (apparent) duplicates are undesirable anyway:

Summaries are often preferred over raw data
A B
pepperoni pepperoni: 4
mushroom mushroom: 2
pepperoni cheese: 1
mushroom
pepperoni
cheese
pepperoni


In my experience, the most common scenario involves concurrent usage, mutable underlying data, and two laws of user expectation: (1) if an item satisfied the query predicate continuously from before the query was issued to after the user reached the last page of results, the user should see the item and (2) one unchanging item should appear at most once in the pages.

Where these requirements are satisfiable, we'd like to encapsulate and reuse the implementation details. The interface should work in domain terms: given a data element, total order relation, and window size as input, it could provide the next n elements.

The main idea here is pretty simple, both in conception and in implementation. Part of the motivation for this blog post though was to discuss the details of client interface design and testing.

Here's my first test design for the core functionality:


[Test]
public void paging() {
const int count =
(int)(
// first page: stable + deleted
(pageSize + 1.0) * 3.0 / (1 + 1 + 0) +
// second page: stable + deleted + ins
(pageSize + 1.0) * 3.0 / (1 + 1 + 0) +
// third page: stable + ins
(pageSize + 1.0) * 3.0 / (1 + 1 + 0));
var items = Enumerable.Range(0, count).ToList();
var stable =
Enumerable.Where(
items,
x => x % 3 == 0).ToList();
var inserting =
Enumerable.Where(
items,
x => x % 3 == 1).ToList();
var deleting =
Enumerable.Where(
items,
x => x % 3 == 2).ToList();
var attendance =
Enumerable.ToDictionary(
items,
x => x,
x => 0);
try {
foreach (var i in stable) {
persist(i);
}
foreach (var i in deleting) {
persist(i);
}
int timer = 0;
foreach (var p in getPages()) {
foreach (var i in p) {
++attendance[i];
}
if (timer == 1) {
foreach (var ins in inserting) {
persist(ins);
}
}
else if (timer == 2) {
foreach (var del in deleting) {
delete(del);
}
}
++timer;
}
Assert.IsTrue(timer > 3, "There were too few pages in this test to see both kinds of concurrency anomaly.");
Assert.IsTrue(
Enumerable.All(
stable,
x => attendance[x] == 1),
"Every item that is present throughout the process should be seen exactly once.");
Assert.IsTrue(
Enumerable.Any(
deleting,
x => attendance[x] == 0),
"Some item deleted during this test should not have been seen. This may indicate eager retrieval of pages, which is isn't feasible in general (for large datasets).");
Assert.IsFalse(
Enumerable.Any(
items,
x => attendance[x] > 1),
"No item should be seen twice.");
// it's OK if items inserted during the test are seen
}
finally {
foreach (var i in items) {
delete(i);
}
}
}


For me, the best thing about this code is its strong focus on two key axioms. I partition all possible inputs in such a way as to demonstrate those, and then I arrange to draw samples from those partitions in what I hope is pretty unobtrusive way. I check that I've covered the necessary cases and then I show that the axioms are true for my sample data. getPages simulates sequential uses of the paging abstraction by a client:


protected override IEnumerable<IEnumerable<int>> getPages() {
const int pageSize = 10;
LocalItem memento = null;
IEnumerable<LocalItem> p;
do {
var slicedCriteria =
Potpourri.Page<LocalItem>.page(
DetachedCriteria.For(typeof(LocalItem)),
new [] {"itemNumber"},
memento);
p = LocalItem.SlicedFindAll(0, pageSize + 1, slicedCriteria);
memento =
Enumerable.LastOrDefault(p);
yield return
Potpourri.slice(
Enumerable.Select(
p,
i => elementToInt(i)),
0,
pageSize);
} while (Enumerable.Count(p) > pageSize);
}


Details: in real life this code runs against shared relational database instances, but for testing purposes I execute against SQLite for isolation. This test occurs in an abstract TestFixture with two concrete subclasses, one of whose getPages method is shown above.

The other uses the raw NHibernate windowing primitives in the naive way; that test class is marked with the IgnoreAttribute because that approach is doomed. I actually started by writing that test case as an executable demonstration of the motivation for the less obvious technique.

Here's my second test design. I think on balance it's actually less readable than the first attempt. I wonder how many people would agree?


[Test]
public void insertionSansDuplicates() {
insertion(fetchPageByValueEquality);
}

[Test]
[ExpectedException(typeof(DuplicateDetectedException))]
public void insertionAnomalies() {
insertion(fetchPageByOffset);
}

[Test]
public void deletionSansSkips() {
deletion(fetchPageByValueEquality);
}

[Test]
[ExpectedException(typeof(AnomalyDetectedException))]
public void deletionAnomalies() {
deletion(fetchPageByOffset);
}


Again, I wrote working test code for both straightforward, broken use of primitives and for the approach that suits my needs.

Here's how I address the two kinds of trouble that can result from trying to use offsets for paging in the face of concurrent modification:


private class AnomalyDetectedException : Exception {}
private class DuplicateDetectedException : AnomalyDetectedException {}
private void insertion(FetchPage fetchPage) {
var q = DetachedCriteria.For<Beer>();
var O = new [] {"ABV", "BrewDate", "Variety"};
Paging.AddOrder(q, O);
var seenit = new HashSet<Beer>();
using (var factory = Persistence.makeSessionFactory()) {
using (var s = factory.OpenSession()) {
var page =
fetchPage(
s,
q,
O,
null,
pageSize + 1);
foreach (Beer i in Enumerable.TakeWhile(page, (elt, i) => i < pageSize)) {
if (seenit.Contains(i)) throw new DuplicateDetectedException();
seenit.Add(i);
}
var first = (Beer)page[0];
var nouveau =
new Beer {
Variety = "not" + first.Variety,
BrewDate = first.BrewDate,
ABV = first.ABV / 2};
s.Save(nouveau);
s.Flush();
page =
fetchPage(s, q, O, Enumerable.Last(page), pageSize);
foreach (Beer i in page) {
if (seenit.Contains(i)) throw new DuplicateDetectedException();
seenit.Add(i);
}
}
}
}

private void deletion(FetchPage fetchPage) {
var q = DetachedCriteria.For<Beer>();
var O = new [] {"ABV", "BrewDate", "Variety"};
Paging.AddOrder(q, O);
var expected = new HashSet<Object>();
var actual = new HashSet<Object>();
using (var factory = Persistence.makeSessionFactory()) {
using (var s = factory.OpenSession()) {
var page = fetchPage(s, q, O, null, pageSize);
foreach (var i in page) expected.Add(i);
foreach (var i in page) actual.Add(i);
foreach (var i in fetchPage(s, q, O, Enumerable.Last(page), pageSize)) {
expected.Add(i);
}
s.Delete(page[0]);
s.Flush();
foreach (var i in fetchPage(s, q, O, Enumerable.Last(page), pageSize)) {
actual.Add(i);
}

actual.SymmetricExceptWith(expected);
if (0 != actual.Count) throw new AnomalyDetectedException();
}
}
}


The test data are much less carefully prepared:


[TestFixtureSetUp]
public void setup() {
Persistence.rebuildDB();
using (var factory = Persistence.makeSessionFactory()) {
using (var s = factory.OpenSession()) {
Assert.That(s.Linq<Bottle>().Count(), Is.GreaterThan(pageSize));
Assert.That(s.Linq<Beer>().Count(), Is.GreaterThan(pageSize));
}
}
}


That rebuildDB method isn't particularly geared towards the paging problem. It's really intended for use with a broader class of NHibernate-related addons.

I also have boundary test cases:


[Test]
public void empty() {
using (var factory = Persistence.makeSessionFactory()) {
using (var s = factory.OpenSession()) {
var max =
Queryable.Max(
s.Linq<Bottle>(),
i => i.Serial);
var source =
DetachedCriteria.For(
typeof(Bottle));
Assert.IsFalse(
Enumerable.Any(
Paging.Page(
s,
source,
new [] {"Serial"},
new Bottle { Serial = max + 1 },
pageSize)));
}
}
}

[Test]
public void singleton() {
using (var factory = Persistence.makeSessionFactory()) {
using (var s = factory.OpenSession()) {
var max =
Queryable.Max(
s.Linq<Bottle>(),
i => i.Serial);
var source =
DetachedCriteria.For(
typeof(Bottle)).
AddOrder(Order.Asc("Serial"));
Assert.AreEqual(
1,
Enumerable.Count(
Paging.Page(
s,
source,
new [] {"Serial"},
new Bottle { Serial = max },
pageSize)));
}
}
}


When there's no concurrent data modification, we get equivalent results with the simpler, direct use of NHibernate primitives as with the more complicated use of my paging abstraction:


[Test]
public void naturals() {
using (var factory = Persistence.makeSessionFactory()) {
using (var s = factory.OpenSession()) {
var Q =
Enumerable.Select(
Enumerable.Range(0, 2),
x => DetachedCriteria.For<Bottle>().AddOrder(Order.Asc("Serial"))).ToArray();
Assert.AreEqual(
Q[0].GetExecutableCriteria(s).SetMaxResults(pageSize).List<Bottle>(),
Paging.Page<Bottle>(s, Q[1], new [] {"Serial"}, null, pageSize));
}
}
}

[Test]
public void beers() {
using (var factory = Persistence.makeSessionFactory()) {
using (var s = factory.OpenSession()) {
var O = new [] {"ABV", "BrewDate", "Variety"};
// ICriterion instances are mutable and don't
// expose a copy constructor or clone method,
// so we must make two criteria, just alike.
var Q =
Enumerable.Select(
Enumerable.Range(0, 2),
x => DetachedCriteria.For<Beer>()).ToArray();
foreach (var q in Q) {
Paging.AddOrder(q, O);
}
Assert.AreEqual(
Q[0].GetExecutableCriteria(s).SetMaxResults(pageSize).List<Beer>(),
Paging.Page(s, Q[1], O, null, pageSize));
}
}
}


I welcome your comments and criticism. Please see ormar for details or to publish your own code review.

vogon fodder

In this post, I show-and-tell about automated paper-pushing.

I spent considerable effort at work last week building something that is minimally useful. I am preoccupied by the conflict between the demands of professionalism and the idiotic/cynical mockery of professionalism that is usual (though not universal) in large organizations. But I've got to eat and I should focus on problems I can realistically solve. For now, let's accept the demands of management without question and focus on how to satisfy the demands at minimum cost to human life. In the words of Steve Jobs:

Well, let's say you can shave 10 seconds off of the boot time. Multiply that by five million users and thats 50 million seconds, every single day. Over a year, that's probably dozens of lifetimes. So if you make it boot ten seconds faster, you've saved a dozen lives. That's really worth it, don't you think?

Your manager emails you an MSWord document and asks you fill it out once per software deployment. It should list every affected "object" and indicate when if ever "User" has been notified, when and who if ever "BAM" has approved, and it should reference the other paperwork associated with the project. What do you do?

Clearly, the right answer involves CI and LaTeX. In my case: svn, ccnet, Bugzilla, python, xslt, ruby, LaTeX, perl, and pdf.

We've been using CI with SVN to do deployments since 2004. There's a branch for each deployment target, and a CI project associated with each of those branches. The CI builds usually correspond to a single commit, but we don't insist on that; when commits come fast enough we go for speed over precision in who's "on the hook" for breaking changes. So the first problem is to find the union of affected "objects" over the set of svn commits in the deployment.

The term "object" is taken from the argot of the RPGers, who are the oldest and largest group of software people at our company. For them, "object" often refers to a printed tabular summary of data but more generally can refer to a program for any purpose. The usual custom outside the RPG tribe is to list as "objects" ui elements, but sometimes instead OO classes or source files are listed. It's not worth being too careful about this, because the real goal is pro forma compliance with a vague demand for documentation. Users basically want release notes, and engineers already get precise, accurate, and efficient answers to technical questions from svn and Bugzilla. The form really just needs to exist and, at a glance, to appear normal.

Our CI system CruiseControl.NET has a feature called modification writer that renders metadata about a build to XML. This provides a list of added, deleted, and textually-modified files for each commit that's new since the last build:

<?xml version="1.0" encoding="utf-8"?>
<ArrayOfModification xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Modification>
<Type>Modified</Type>
<FileName>ccnet-self.config</FileName>
<FolderName>/ccnet</FolderName>
<ModifiedTime>2009-07-24T10:23:08.162158-05:00</ModifiedTime>
<UserName>sean.foy</UserName>
<ChangeNumber>4413</ChangeNumber>
<Version />
<Comment>whitespace
[Bug 50,51,3783]</Comment>
<IssueUrl>https://bugzilla.thcg.net/buglist.cgi?bug_id=50,51,3783</IssueUrl>
</Modification>
</ArrayOfModification>

We integrated SVN and Bugzilla in 2004, and so since that time we've been able to correlate commits with bugs. This gives us informal commentary about the motivations and the design considerations for each commit, plus metadata including who was involved in the process, how much time each of us spent, whether we paired or got code review, whether dbas got involved, who approved deployment, etc. Well, at least in principle. Actually, only one or two projects have ever really taken advantage of these capabilities. But the point is that this metadata fits the purported goal of our paperwork, and it's useful and easily accessible, and it exists for at least one project. It's the best we can do, and it's pretty good. So we've got to use it.

Python is a fairly pleasant language to work in and nobody's terrified by its syntax. It also has great libraries for screen-scraping. Accordingly, we used a python program to augment ccnet's modifications xml. We read the ccnet xml and use the bugtraq standard to write back the set of relevant bugs:

<ArrayOfModification>
<Modification>
<Type>Modified</Type>
<FileName>ccnet-self.config</FileName>
<FolderName>/ccnet</FolderName>
<ModifiedTime>2009-07-24T10:23:08.162158-05:00</ModifiedTime>
<UserName>sean.foy</UserName>
<ChangeNumber>4413</ChangeNumber>
<Version />
<Comment>whitespace
[Bug 50,51,3783]</Comment>
<IssueUrl>https://bugzilla.thcg.net/buglist.cgi?bug_id=50,51,3783</IssueUrl>
<database />
<flags>
<flag id="1882" name="QCApproved" setter="michael.jorgensen@covidien.com" status="+" />
</flags>
</Modification>
</ArrayOfModification>

Now we have all the answers we need to fill out our worksheet in XML. We just need to produce MSWord output (good luck). Well, MSWord probably isn't really necessary. That's what's expected, but also we should keep in mind that the goal is pro forma compliance. This document's real use is to ward off auditors, so it just needs to exist and not look unusual. So, we want to typeset XML? Obviously we should consider DocBook, XSL-FO, and LaTeX. I looked into XSL-FO c. 2001 and DocBook c. 2002, but I know LaTeX best. I wanted to spend less than 4 hours on this little project, so I was satisfied to use LaTeX.

For processing XML, you could do worse than XSLT. It can be particularly nice as a declarative template specification: "make my input look like this". So, I used xslt to process the bz-augmented ccnet modifications xml into LaTeX:

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns:fn="http://www.w3.org/2005/xpath-functions" xmlns:smf="http://seanfoy.thcg.net/xslt">
<xsl:output method="text" />

<xsl:param name="asr">application service request number</xsl:param>
<xsl:param name="builddate">the date these changes became effective</xsl:param>
<xsl:param name="trunkdeploymenttarget">unc path</xsl:param>
<xsl:param name="trunkdbserver">hostname</xsl:param>
<xsl:param name="trunkdbname">dbname</xsl:param>
<xsl:param name="productiondeploymenttarget">unc path</xsl:param>
<xsl:param name="productiondbserver">hostname</xsl:param>
<xsl:param name="productiondbname">dbname</xsl:param>

<xsl:function name="smf:latexlit">
<xsl:param name="raw" />
<xsl:value-of select="fn:replace(fn:replace($raw, '([$&amp;%#_{}~^\\])', '\\$1'), '\\\\', '\$\\backslash\$')" />
</xsl:function>

<xsl:template match="@*|node()">
</xsl:template>
<xsl:template match="/">
<xsl:text>
\documentclass{article}

\usepackage{hyperref}
\usepackage{longtable}
\usepackage{latexsym}

\usepackage[document]{ragged2e}
\usepackage{wordlike}
\renewcommand{\familydefault}{phv} % Arial
\sloppy
\makeatletter
\renewcommand{\subsection}{\@startsection%
{subsection}%
{1}
{0em}
{-\baselineskip}%
{0.5\baselineskip}%
{\bfseries\LARGE\sffamily}}
\makeatother

\author{Sean Foy}
\title{Sample}

\begin{document}
\section*{Internet/Intranet Production Move Request}
\textbf{ASR /Track-it WO\#:} </xsl:text><xsl:value-of select="smf:latexlit($asr)" /><xsl:text>\hfill\textbf{Move Date:}

\textbf{Requestor/ User Notification Sent:} $\Box$ \textbf{Dated:}

\subsection*{Object(s): $\Box$Web \qquad $\Box$WebMethods (WM)}
\begin{longtable}{|l|l|}\hline
</xsl:text>
<xsl:for-each select="ArrayOfModification/Modification">
<xsl:value-of select="smf:latexlit(FileName)" />
<xsl:text>&amp; -r</xsl:text><xsl:value-of select="smf:latexlit(ChangeNumber)" />
<xsl:if test="fn:not(fn:empty(IssueUrl))">
<xsl:text> (</xsl:text><xsl:value-of select="smf:latexlit(IssueUrl)" /><xsl:text>)</xsl:text>
</xsl:if>
<xsl:text> \\</xsl:text>
</xsl:for-each>
<xsl:text>\hline
\end{longtable}

\subsection*{DB Object(s):}
\begin{longtable}{|l|l|}\hline
</xsl:text>
<xsl:for-each select="ArrayOfModification/Modification[database]">
<xsl:value-of select="smf:latexlit(FileName)" />
<xsl:text>&amp; -r</xsl:text><xsl:value-of select="smf:latexlit(ChangeNumber)" />
<xsl:if test="fn:not(fn:empty(IssueUrl))">
<xsl:text> (</xsl:text><xsl:value-of select="smf:latexlit(IssueUrl)" /><xsl:text>)</xsl:text>
</xsl:if>
<xsl:text> \\</xsl:text>
</xsl:for-each>
<xsl:text>
\hline
\end{longtable}

\subsection*{Deployment Servers:}
\begin{tabular}{|l|l|l|}\hline
&amp;
Development&amp;
Production\\\hline
Web:&amp;
\textbf{UNC:} </xsl:text><xsl:value-of select="smf:latexlit($trunkdeploymenttarget)" /><xsl:text>&amp;
\textbf{UNC:} </xsl:text><xsl:value-of select="smf:latexlit($productiondeploymenttarget)" /><xsl:text>\\\hline
DB:&amp;
\textbf{Server:} </xsl:text><xsl:value-of select="smf:latexlit($trunkdbserver)" /><xsl:text> \textbf{Database:} </xsl:text><xsl:value-of select="smf:latexlit($trunkdbname)" /><xsl:text>&amp;
\textbf{Server:} </xsl:text><xsl:value-of select="smf:latexlit($productiondbserver)" /><xsl:text> \textbf{Database:} </xsl:text><xsl:value-of select="smf:latexlit($productiondbname)" /><xsl:text>\\\hline
WM:&amp;
\textbf{UNC:} &amp;
\textbf{UNC:} \\\hline
\end{tabular}

\rule{\textwidth}{0.25em}

\textbf{Move Approved by BAM}:

Signature: \underline{\makebox[2in][l]{}}

Date: \underline{\makebox[1in][l]{}}

\end{document}
</xsl:text>
</xsl:template>
</xsl:stylesheet>

LaTeX is based on Don Knuth's TeX typesetting system. It's serious business: Turing-complete and designed specifically for the needs of the world's most demanding typographical experts.

For a while I thought about merging the bug and svn data here in LaTeX. My TeX-hacking skills are very weak now so I had in mind the python package. As I escaped for XSL the escape for LaTeX of the escape for Python of the LaTeX output I wanted, I thought I was probably making a mistake. LaTeX is great for typesetting, and capable of general computation, but it's awkward to express the heavy-lifting there (even in an embedded language). The python package is probably better used tactically, for simple comprehensions and the like -- not for elaborating significantly on the structure of the LaTeX markup.

We must compile LaTeX to obtain our output (normally PostScript or PDF). I use a perl script, latexmk, to manage the multi-phase compilation process and keep track of cleaning up the intermediate products.

So we want to combine (((xml and Buzilla) by python to xml) and xsl) by xslt to LaTeX by latexmk to PDF). We need to process our inputs and intermediate products in topological order; this is a build process. I've used make before, but it's a bit crufty. I prefer NAnt, whose XML syntax is more explicit and discoverable. From what I've seen of rake, I like it better than NAnt. NAnt's a priori boundary between its implementation and your build specification is confining. Rake draws no distinction between itself and your build specification.

# -*- mode: ruby -*-
# Render modifications as beautiful
# (or hideous!) PDF
#
# sample invocation:
# rake latexmk=vendor/latexmk/latexmk.pl modifications.pdf

require 'rake/clean'
require 'rexml/document'
require 'rexml/xpath'

transform = "#{ENV['transform'] || 'vendor/saxonb9-1-0-7n/bin/Transform.exe'}"
latexmk = "#{ENV['latexmk'] || 'vendor/latexmk/latexmk.pl'}"

modifications_xml = "#{ENV['modifications_xml'] || '../Artifacts/modifications.xml'}"

task :default => 'modifications.pdf'

task :clean => ['clobberhelper'] do
end

task :clobber => ['clobberhelper'] do
end

def contains_clobber(ts)
ts.each { |i|
return true if (i == 'clobber') || contains_clobber(Rake::Task[i].prerequisites)
}
return false
end

task :clobberhelper do |t|
if contains_clobber(Rake::application().top_level())
sh latexmk, '-C'
else
sh latexmk, '-c'
end
end

file 'modifications.tex' => [modifications_xml, 'modifications.xsl'] do |t|
xsl_params = {}
begin
f = File.new('modifications.xsl')
doc = REXML::Document.new(f)
REXML::XPath.each(doc.root, 'xsl:param', {'xsl' => 'http://www.w3.org/1999/XSL/Transform'}) do |elt|
xsl_params[elt.attributes['name']] = ENV[elt.attributes['name']] || elt.text
end
ensure
f.close unless f.nil?
end
xsl_args = xsl_params.inject([]){ |accum, (key, value)|
accum << "#{key}=#{value}"
}
sh transform, "-s:#{modifications_xml}", '-xsl:modifications.xsl', '-o:modifications.tex', *xsl_args
end
CLEAN.include('modifications.tex')

file 'modifications.pdf' => ['modifications.tex'] do |t|
sh latexmk, '-pdf', 'modifications.tex'
end

If you don't know about rake, one point of confusion is that rake clean conventionally means "remove intermediates" while rake clobber means "remove intermediate and final outputs".

With my use of latexmk, I encountered a technical problem in the implementation of the standard clean and clobber rake tasks: latexmk needs the .tex file to decide what to clean up, but rake wants to clean (removing the .tex) before it clobbers. I needed to let latexmk do its job before clean, and latexmk needs to know whether to clean up the PDF or just the intermediates. My solution was to add a dependency clobberhelper for the clean task to call latexmk's cleanup routine. clobberhelper searches rake's dependency graph for clobber in order to decide how to call latexmk.

Finally, we have a corporate commitment to .NET and so most of our projects benefit from NAnt's understanding of .NET. We want an easy way to use this motley assortment of tools with our existing NAnt build files:

<?xml version="1.0" encoding="utf-8"?>
<project name="sdim" default="move-worksheet">
<property name="artifacts_dir" value="." overwrite="false" />
<!-- nant for Windows doesn't use the cygwin path so
you need to chdir to %cygwin_home%\bin and then
run %cygwin_home%\path\to\python or
install Windows python and do e.g.,
C:\Python31\python.exe -->
<property name="python" value="C:\Python31\python.exe" overwrite="false" />
<property name="rake" value="C:\Ruby\bin\rake.bat" overwrite="false" />
<property name="modifications_xml" value="modifications.xml" overwrite="false" />

<target name="move-worksheet">
<exec program="${python}">
<arg line="bzaugment.py ${path::combine(artifacts_dir, modifications_xml)} -o ${path::combine(artifacts_dir, 'modifications.augmented.xml')}" />
</exec>
<exec program="${rake}">
<arg line="modifications_xml=${path::combine(artifacts_dir, 'modifications.augmented.xml')}" />
</exec>
</target>

<target name="clean">
<exec program="${rake}">
<arg line="clobber" />
</exec>
<delete file="${path::combine(artifacts_dir, 'modifications.augmented.xml')}" />
</target>
</project>

I still have some reservations about the morality of this. On the one hand, no human should waste even 30 minutes of precious life preparing ambiguous, incomplete, inaccessible and uninsightful summaries of data that's already available, accurate, complete, and easily-accessible. On the other hand, you can lead a manager to water but you can't make him drink. This solution seems to balance those two forces, but by reducing the cost of useless redundancy, am I encouraging further bureaucratic malignancy?

Viola:


Anyway, I hope you enjoyed the tour of this motley assemblage of languages and tools.


Thanks to Josh Buedel for comments and suggestions during the preparation of this post.








23.1: new .emacs


(setq ns-command-modifier 'meta)
(if (functionp 'tool-bar-mode)
(tool-bar-mode -1))

(setq default-buffer-file-coding-system 'utf-8-unix)
(setq-default indent-tabs-mode nil)
(put 'erase-buffer 'disabled nil)

(cond
((file-exists-p "/sw/bin")
;; MacOS X + fink
(setenv "PATH" (concat (getenv "PATH") ":/sw/bin"))
(setq exec-path (append exec-path '("/sw/bin"))))
((file-exists-p "c:/")
;; Windows
(setenv "PATH" (concat (getenv "PATH") ";c:\\cygwin-1.7\\bin"))
(setq exec-path (append exec-path '("c:\\cygwin-1.7\\bin")))
(if (require 'cygwin-mount nil t)
(progn
(cygwin-mount-activate)
(setq shell-file-name "bash")
(setenv "SHELL" shell-file-name)
(setq w32-quote-process-args ?\")
;; for subprocesses invoked via the shell
;; (e.g., "shell -c command")
(setq explicit-shell-file-name shell-file-name)
(setq explicit-sh-args '("-login" "-i"))))
; (if I'm using Windows, I must be at work!)
(setq bug-reference-url-format
"https://bugzilla.thcg.net/show_bug.cgi?id=%s")
(setq bug-reference-bug-regexp
"\\(?:[Bb]ug ?#?\\|PR [a-z-+]+/\\)\\([0-9]+\\)")))

(dolist (
prog-mode-hook
'(python-mode java-mode ruby-mode js2-mode nxml-mode))
(add-hook
(car (read-from-string (concat (symbol-name prog-mode-hook) "-hook")))
(lambda ()
(setq indent-tabs-mode nil))))

(autoload 'python-mode "python-mode" "Python editing mode." t)
(add-to-list 'auto-mode-alist '("\\.py$" . python-mode))
(add-to-list 'interpreter-mode-alist '("python" . python-mode))
(add-hook 'python-mode-hook
(if (null (getenv "LC_CTYPE"))
(setenv "LC_CTYPE" "C")))

(add-to-list 'auto-mode-alist '("\\.cs$" . java-mode))

(autoload 'js2-mode "js2" nil t)
(add-to-list 'auto-mode-alist '("\\.js$" . js2-mode))

(autoload 'ruby-mode "ruby-mode" "Ruby editing mode." t)
(add-to-list 'auto-mode-alist '("\\.rb$" . ruby-mode))

(setq auto-mode-alist
(cons '("\\.\\(xml\\|xsl\\|rng\\|xhtml\\)'" . nxml-mode)
auto-mode-alist))
(add-to-list 'magic-mode-alist '("<\\?xml " . nxml-mode))

(if (load "auctex.el" t t t)
(load "preview-latex.el" nil t t))
(setq ispell-program-name "aspell")
(setq ispell-extra-args '("--sug-mode=ultra"))