Wednesday, March 17, 2010

Versioning Bags

I've been doing some work recently with BagIt.  BagIt is a specification that gives a directory of files some additional semantics, including a manifest with checksums and some minimal metadata.  I like the specification a lot.  I have recently become interested in being able to version a bag ("bag" is what you call a directory that conforms to BagIt).  CDL has another specification called ReDD that I don't like nearly as much as I like BagIt, but there were some ideas from that spec that are just too good to throw out.

BagIt puts the contents of the directory into a sub-directory called "data".  In the mockup below, there is a directory at the same level as the data directory called "reverse-deltas".  Inside are directories named with a timestamp.  Each of these directories is a valid bag in its own right (important so that the tools for passing around bags can be used to pass around reverse deltas, and to keep track of the fixities of the reverse deltas).  Each of these can be used to move the bag backward in time, by deleting the files in delete.txt and adding all the files inside the add/ directory if it is present:

|-bag-info.txt
|-manifest-md5.txt
|-data/
  |-file1.txt
  |-file2.txt
  |-file3.txt
|-reverse-deltas/
  |-2009-10-21-23-59-59
    |-bag-info.txt
    |-manifest-md5.txt
    |-data/
      |-delete.txt          # files to be deleted
      |-add/                # files to be added
        |-file1.txt
        |-file2.txt
  |-2009-10-10-12-01-02
    |-bag-info.txt
    |-manifest-md5.txt
    |-data/
      |-delete.txt

In terms of the tool-chain, I think this presents some interesting possibilities.  The most recent version of a bag is always in the data/ directory, and can be grabbed with any BagIt-compliant tools.  BagIt just ignores the other directories at the top-level.

However, if one were going to build a service on top of a versioned bag, it would be fairly straightforward to provide an identifier scheme that would get back to any previous version of a bag (e.g. http://davidbrunton.com/bag/1?date=2009-10-21).  One could even assume an earliest version that specifies deletion of all the files, possibly yielding a 404.

I'm curious to hear what use cases this meets and which it fails to meet.  Anyone?

Friday, March 12, 2010

This Note Has No Title

A computer is a machine made only of switches.  We forget this.  We think in computational metaphors: functions, procedures, objects, monads, functors, generators, routines.  Even low-level ints, bools, chars, and floats obscure the facts from us.  A computer is a machine that knows nothing of language or metaphor: every operation is nothing more than the opening or closing of an electric circuit.

An initial useful abstraction beyond this concrete reality is that some of the switches are controlled indirectly.  They are electrical circuits that can only be opened or closed by the computer itself- they cannot be manipulated directly.  We call these circuits the computer's memory.

A second useful abstraction of our switch-machine is that we can store the process for using it.  Humans have been doing this with machines for at least several hundred years, but beginning around sixty years ago, we began to use the machine itself to store this information.

This page is powered by Blogger. Isn't yours?

Subscribe to Posts [Atom]