Wednesday, March 17, 2010

Versioning Bags

I've been doing some work recently with BagIt.  BagIt is a specification that gives a directory of files some additional semantics, including a manifest with checksums and some minimal metadata.  I like the specification a lot.  I have recently become interested in being able to version a bag ("bag" is what you call a directory that conforms to BagIt).  CDL has another specification called ReDD that I don't like nearly as much as I like BagIt, but there were some ideas from that spec that are just too good to throw out.

BagIt puts the contents of the directory into a sub-directory called "data".  In the mockup below, there is a directory at the same level as the data directory called "reverse-deltas".  Inside are directories named with a timestamp.  Each of these directories is a valid bag in its own right (important so that the tools for passing around bags can be used to pass around reverse deltas, and to keep track of the fixities of the reverse deltas).  Each of these can be used to move the bag backward in time, by deleting the files in delete.txt and adding all the files inside the add/ directory if it is present:

|-bag-info.txt
|-manifest-md5.txt
|-data/
  |-file1.txt
  |-file2.txt
  |-file3.txt
|-reverse-deltas/
  |-2009-10-21-23-59-59
    |-bag-info.txt
    |-manifest-md5.txt
    |-data/
      |-delete.txt          # files to be deleted
      |-add/                # files to be added
        |-file1.txt
        |-file2.txt
  |-2009-10-10-12-01-02
    |-bag-info.txt
    |-manifest-md5.txt
    |-data/
      |-delete.txt

In terms of the tool-chain, I think this presents some interesting possibilities.  The most recent version of a bag is always in the data/ directory, and can be grabbed with any BagIt-compliant tools.  BagIt just ignores the other directories at the top-level.

However, if one were going to build a service on top of a versioned bag, it would be fairly straightforward to provide an identifier scheme that would get back to any previous version of a bag (e.g. http://davidbrunton.com/bag/1?date=2009-10-21).  One could even assume an earliest version that specifies deletion of all the files, possibly yielding a 404.

I'm curious to hear what use cases this meets and which it fails to meet.  Anyone?

Friday, March 12, 2010

This Note Has No Title

A computer is a machine made only of switches.  We forget this.  We think in computational metaphors: functions, procedures, objects, monads, functors, generators, routines.  Even low-level ints, bools, chars, and floats obscure the facts from us.  A computer is a machine that knows nothing of language or metaphor: every operation is nothing more than the opening or closing of an electric circuit.

An initial useful abstraction beyond this concrete reality is that some of the switches are controlled indirectly.  They are electrical circuits that can only be opened or closed by the computer itself- they cannot be manipulated directly.  We call these circuits the computer's memory.

A second useful abstraction of our switch-machine is that we can store the process for using it.  Humans have been doing this with machines for at least several hundred years, but beginning around sixty years ago, we began to use the machine itself to store this information.

Thursday, January 21, 2010

Notes on Pragmatic Language Design

If you are Ed, and you are reading this post: no it isn't the post I promised (yet).  But it is a precursor, and you should read it before you read the subsequent one in the series.  Not that you have any choice, since I haven't actually written the next post yet.
I had a nice conversation with a coworker yesterday, in which she asked me a question I hear a lot in library-land: "Is it an identifier for the abstract thing, or just for this manifestation of the thing?"  I reminded her that I don't understand what "abstract thing" means in that context.

Contingent on the acceptance of "thing" as a valid concept, an identifier either identifies that thing or does not.  We can (and at libraries, often do) argue about whether a particular identifier goes with a particular thing, but this is a specific argument rather than an abstract one.  It is an argument about language design for some particular identifier.

If I substitute "word" for "identifier" it brings the problem with my way of thinking into stark relief.  A word either identifies a thing, or it does not.  If my coworker and I agree about the association between the word and the thing, there's no problem.  But if we disagree, we are not using the same word (even though it is spelled the same and pronounced the same).  We are faced with a multiple dispatch scenario: if she understands what I mean by my word and I understand what she means by her word, we need some way to disambiguate which one we are using when we speak.  We sometimes do that by adding other words: pen-in-the-sense-I-mean-it versus pen-in-the-sense-she-means-it.  If we often need to perform this kind of disambiguation, we probably develop a lingo or some jargon that people outside our small subgroup might not immediately grok.

In this way, we are pragmatic language designers.  We are using words to communicate about things, and a word is an identifier for all the things it identifies.  This is a definition, in the words of my brother Daniel, that probably "dissolves into mush" if examined too closely.

But I'm convinced it's right, despite its fragility.  I think it's even more right in library-land.  When we use identifiers, we need clear criteria for what they do-and-do-not identify.  When we get close to the edge of the definition, we should discuss whether a particular thing is in or out rather than trying to speak in abstracts.  And when it becomes evident we need to disambiguate the-thing-that-I-mean from the-thing-that-you-mean, we should carefully consider adding some words (identifiers) to help us with that task.  If we do it repeatedly, we should design them into our language.

And every once in a great while, we should go over the whole language and see if it could benefit from a little refactoring.  See if there are similarities in the places where jargon and lingo are cropping up, see if we can't make it into something that's easy for us all to remember.

Labels: ,


This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]