Wednesday, March 17, 2010
Versioning Bags
I've been doing some work recently with BagIt. BagIt is a specification that gives a directory of files some additional semantics, including a manifest with checksums and some minimal metadata. I like the specification a lot. I have recently become interested in being able to version a bag ("bag" is what you call a directory that conforms to BagIt). CDL has another specification called ReDD that I don't like nearly as much as I like BagIt, but there were some ideas from that spec that are just too good to throw out.
BagIt puts the contents of the directory into a sub-directory called "data". In the mockup below, there is a directory at the same level as the data directory called "reverse-deltas". Inside are directories named with a timestamp. Each of these directories is a valid bag in its own right (important so that the tools for passing around bags can be used to pass around reverse deltas, and to keep track of the fixities of the reverse deltas). Each of these can be used to move the bag backward in time, by deleting the files in delete.txt and adding all the files inside the add/ directory if it is present:
|-bag-info.txt
|-manifest-md5.txt
|-data/
|-file1.txt
|-file2.txt
|-file3.txt
|-reverse-deltas/
|-2009-10-21-23-59-59
|-bag-info.txt
|-manifest-md5.txt
|-data/
|-delete.txt # files to be deleted
|-add/ # files to be added
|-file1.txt
|-file2.txt
|-2009-10-10-12-01-02
|-bag-info.txt
|-manifest-md5.txt
|-data/
|-delete.txt
In terms of the tool-chain, I think this presents some interesting possibilities. The most recent version of a bag is always in the data/ directory, and can be grabbed with any BagIt-compliant tools. BagIt just ignores the other directories at the top-level.
However, if one were going to build a service on top of a versioned bag, it would be fairly straightforward to provide an identifier scheme that would get back to any previous version of a bag (e.g. http://davidbrunton.com/bag/1?date=2009-10-21). One could even assume an earliest version that specifies deletion of all the files, possibly yielding a 404.
I'm curious to hear what use cases this meets and which it fails to meet. Anyone?
Subscribe to Posts [Atom]