by Trey Pendragon
Lately we’ve been working on preservation within Figgy and have come up with a strategy we think works for us. However, in doing so, we’ve evaluated a few options and come up with some assumptions which could be useful to others.
This isn’t our entire preservation story (there are other issues such as replication, management, audit trails, etc), but it compares a few options for our back-end storage.
I talk about an “object” a lot below. After much discussion, we decided at Princeton that we want a “directory” to consist of a top-level object (like a Book) and all its children.
This makes it a requirement that I am able to copy a single directory and get everything I need to re-ingest that object and display it. External references may not come in, but the hierarchy will.
So a Volume will be nested underneath its Multi-Volume Work, and a Page will be nested underneath its Book. This is important, because if one decides against this then the pros/cons of the following cases may change significantly.
We want to store our files in a system which can do the following:
This list is not exhaustive, but it’s the options we evaluated.
Files can be stored in a hierarchy similar to the following:
- <resource-id>
- data
- <child-id>
- <child-id>.json
- <binary.tif>
- <resource-id>.json
The following test cases are largely academic. We are in the process of building this fuctionality now, and will perform these actions as we go along. But they’ve helped us think through what it would take to implement each of the above options.
<book-id>
bagit.txt
in the base directory.bag-info.txt
<book-id>/data
named
<file-id>-<file-name>.<file-extension>
with the binary data. Store the
metadata in <book-id>/metadata/<file-id>.json
.manifest-md5.txt
and
tagmanifest-md5.txt
with the checksums for each file in data
and
metadata
.bag-info.txt
to include the Payload-Oxum
and Bag-Size
Pros:
Cons:
Implementation notes for OCFL provides options for things such as pair-trees for storing objects in the object root. I’m going to treat that as an implementation detail and pretend we’re using a file system that can handle an infinite number of resources in the same directory.
Also, directories in OCFL aren’t really considered to be “directories,” as the specification is meant to work on any platform. As such, files just happen to have slashes in their name. It’s expected and normal that if there’s an empty directory an OCFL client will ignore it or even get rid of it, which simplifies required cleanup.
<book-id>
0=ocfl_object_1.0
with the appropriate contents.v1
directory<book-id>/v1/content
, separated
by ID. So, for instance, <page1>
would be in <book-id>/v1/content/<page1>
,
and have two files: <page1>.json
and
<page1>-<file-name>.<file-extension>
.inventory.json
in v1
, as per the specification, with SHA512s of
every file in content
as the keys of the manifest
property.inventory.json.sha512
in the v1
directory.inventory.json
and inventory.json.sha512
to the <book-id>
directory.Pros:
inventory.json
if we’d like
to, using up less space than the other options.Cons:
inventory.json
and incrementing the version counter.A small note about GCS - I’m going to call paths “directories”, but similar to
OCFL, it’s all just naming semantics. foo/bar/txt.yml
isn’t a file in the
directory bar
, it’s a file that has a name which happens to have slashes in it.
The GCS tools just visualize them as directories to be nice.
<book-id>
<book-id>/<book-id>.json
data
directory,
like <book-id>/data/<file-id>-<file-name>.<file-extension>
<book-id>/data/<page-id>
Pros:
Cons:
Either don’t delete anything or store Bags in GCS and delete the parent directory, relying on versioning.
See Deletion in the OCFL Implementation Notes
<book-id>/v2
directory.v2/inventory.json
file with with a new version
entry that has
nothing in the state
property.v2/inventory.json.sha512
inventory.json
and inventory.json.sha512
to the top
directory.There are unlikely to be locking problems, because nothing happens to the object post-delete.
Enable versioning and delete all files that are in the parent “directory” (a reminder there are no “directories”, so no structure is left around.)
Rely on GCS versioning.
/<book-id>/metadata/<book-id>.json
manifest-md5.txt
and tagmanifest-md5.txt
All of the same locking problems from “Preserve a Book with N Pages” exist here - you’ll have
to lock the entire object hierarchy. This can probably be implemented as an
after_delete
hook on the child object, but will have to be careful that locks
don’t run into one another and the update of the parent object’s membership
doesn’t cause unnecessary changes in the manifests.
See Deletion in the OCFL Implementation Notes
v2/inventory.json
file with with a new version
entry that
doesn’t have the now-deleted Page’s metadata and binary files in itv2/inventory.json.sha512
inventory.json
and inventory.json.sha512
to the top
directory.All of the same locking problems from “Preserve a Book with N Pages” exist here - you’ll have
to lock the entire object hierarchy. This can probably be implemented as an
after_delete
hook on the child object, but will have to be careful that locks
don’t run into one another and the update of the parent object’s membership
doesn’t cause unnecessary changes in the manifests.
<book-id>.json
which holds membership array.These operations can happen independently of one another, and so can just be an
after_save
and after_delete
hook on those resources.
There are a couple options here. We can either create a whole new Bag, or rely on GCS’ versioning. I’m going to assume GCS versioning, because making a new Bag is pretty expensive.
/<book-id>/metadata/<book-id>.json
manifest-md5.txt
and tagmanifest-md5.txt
All of the same locking problems from “Preserve a Book with N Pages” exist here - you’ll have to lock the entire object hierarchy.
See Addition and Updating in the OCFL Implementation Notes
<book-id>/v2
directoryinventory.json
metadata
and binary
files to v2/content
directory
metadata
and binary
files to the manifest
key of
inventory.json
metadata
and binary
files to the state
key of
inventory.json
All of the same locking problems from “Preserve a Book with N Pages” exist here - you’ll have to lock the entire object hierarchy.
<book-id>.json
when the parent’s membership is updatedmetadata
and binary
files with the appropriate
names.If any locking is necessary, it will only have to be those files with new
content, and so can be implemented as part of a save
operation on a
per-resource basis.
The summary for this is just treat child Books as if they’re “Pages”, and allow for everything to go arbitrarily deep. There’s no real difference between this and previous cases, except that Google Cloud Storage can handle each “Volume” without ever touching its parent. Deletions and additions of “Pages” never have to go up more than one level in the hierarchy.
We chose the Google Cloud Storage method, as seen in our Architecture Decision Record.
The cases above show that because of our choice to have a hierarchy of children it was by far the easiest to implement, the least complex, the least likely to break, and didn’t require any special locking mechanisms. The implementation can be seen here.
Based on the above I personally recommend that if one chooses either OCFL or BagIt that you either store everything flat (every “Page”/”Volume”/”Book” in the same object root) or have a very good and well-tested pessimistic locking implementation.
However, our system is very new, so we’ll see how things evolve!