Sunday, October 11, 2009

Version numbers and (D)VCS

I've spent a lot of time this weekend trying to adapt apply-version.nant, originally written for oopsnet and svn, to ormar and mercurial. It wasn't so easy to find guidance from others with similar goals, so hopefully this post makes the information more accessible.


The primary purpose of software version numbers is to identify the code. When we distribute to users, the version number gives them a concise reference for use in communicating about bugs and available features.

A secondary purpose is to define a temporal order on distributions. Versions are normally numeric and monotonically increasing; higher versions are newer and hopefully better than older versions.

Conventionally, there is at least some human involvement in assigning version numbers. In proprietary software, concerns about sales normally are important. Often times, the technical issue of compatibility is difficult to treat formally, but human judgment is used to encode compatibility information in version numbers.

But it's also conventional to let the low-order bits of a version number be determined automatically. Whenever build inputs change, the behavior of the software may change, but nobody wants to be bothered incrementing version numbers for routine small changes.

Besides version numbers, it's common to hear of "build numbers." Sometimes the terms are used interchangeably, but I think it's useful to distinguish between (a) an identifier for the build input and (b) an identifier for the build event. Some people use (b) as a more convenient proxy for (a), and some people apparently really care about (b) itself, although I'm not sure why. Maybe it's because on some teams deployment is part of the build process*, and it's nice to have a formal record of deployments.

Theory and practice

I've used Subversion for nearly my whole career so far. It's a centralized version control system and sensibly enough it identifies historical events with natural numbers; revision 0 is repository creation. So, svn revision numbers are a very convenient basis for version numbers. Just check for consistency of the wc with a revision from the repository and use that revision number for the low-order version bits. Consistency between wc and repository for svn is a question of (a) uncommitted changes and (b) files or directories within the wc at different revisions.

This is a bit harder to do with some other centralized version control systems. In SCCS, revision numbers (SIDs) apply not to repositories but to individual files. Microsoft TFS has SVN-style revision numbers that they call "changeset numbers," but their implementation choices and tools make it difficult and expensive to answer the wc-repository consistency question. But fundamentally, in a cvcs, there's a global clock and that can serve as a basis for version numbering. In every cvcs I've seen, it's practical to use it that way although it might be easier in some cases (svn) and harder in others (tfs).

For distributed version control systems, we have no global clock. Fundamentally, events are temporally ordered only by causal relationships, so you can really only establish the primary property for version numbers: identifying build inputs. There's no general way to establish the secondary property that allows users to compare version numbers and decide which is newer. And yet, the mercurial project itself produces monotonic version numbers! How do they do it? Apparently by manually tagging.

How important is temporal ordering really? Certainly the most important thing is the capability for repeatable builds. Some DVCS projects have concise identifiers for build inputs; in hg and git we have hashcodes. Unfortunately for those of us on the .NET platform, Microsoft provides only 4 * 16 bits of space for version numbers in their assembly metadata specification. This isn't nearly enough for hg 160 bit changeset ids (though it could accommodate the potentially ambiguous 48 bit short form), especially if we want to use one or more of those four fields for encoding compatibility data.

A common special case

There's a very common special case of projects using dvcs for which we can establish an objective order on versions. There's often an official repository, whose event history meets our need.

Well that's fine in theory, but is it practical? Unfortunately for me, hg doesn't allow remote queries of a given repository's local timestamps ("revision numbers"). I hope that's due to an efficiency trade-off and not just a pedantic effort ("these revision numbers aren't guaranteed to match those in your clone; use changeset ids instead!").

The good news is that in hg, revision numbers consistency is preserved under clone and pull operations. If you commit in your repository, you may irreconcilably lose consistency, but as long as you abstain from making changes you and the other repo will agree on the bijective function between revision numbers and changeset ids. So my plan for .NET assembly versioning in my googlecode hg repositories is to use a pristine clone for official builds and at least one separate clone for feature development and bug fixes.

*For the IT web apps I've worked on, we had automated deployment as part of our CI routine, but we were satisfied to have an svn branch per deployment target. Actually, we had one svn branch per deployment target equivalence class representative. Really we had a small number of user communities (e.g., end-users, beta-testers, trainees, programmers), and we had a branch for each of them and the server clusters for each. (back)

No comments:

Post a Comment