File Fixity and Data Integrity

You have audited your digital content, established appropriate metadata schema, and vetted your new taxonomy. The DAM system is installed, customized, and tested. All you have to do is catalog assets. Right? Well, if digital preservation is high on your list of priorities, and it should be, you will also want to ensure assets remain “what they purport to be”. Your efforts will ensure asset authenticity and help your users trust the data you manage on their behalf.

One recent study found that many user groups are not conducting data reviews on ingest or at points throughout the lifecycle of their digital assets.

… as a result, [they] were unable to identify early signs of media failure or determine if and when a data change or loss event has occurred (Knight, 2012, p. 245)

To manage and protect valuable digital files, shouldn’t we expect there to be mechanisms baked in to DAM systems to verify file integrity? Regrettably, most DAM systems lack this important ability. In fact some vendors were surprised and incredulous of the need for it when I reached them for comment. Really? Strange since DAM systems purport to be digital repositories, the very same described in the OAIS conceptual model. According to OAIS documentation Information Packages must be validated against a digital signature created prior to ingest and at points along an asset’s lifecycle. Furthermore the ability to perform data integrity checks out of the box seems to be a low-effort proposition: many free tools, such as Carbon Copy Cloner, have offered it for years. So I call on DAM system vendors to add this functional requirement to the next minor or major release of their products.

Authenticity can be established by judging an object’s identity, made up of such attributes as provenance, creation date, relationships to other assets, etc. This is all well and good for physical items, but in the digital realm, data can easily be changed or corrupted without affecting descriptive metadata or vice-versa. For example, it is difficult to tell whether “photo.psd” is the same file as “photo-copy.psd”. To solve this riddle, we must pay attention to a bitstream’s integrity, or completeness, over time. Digital preservationists have a yardstick for this called fixity.

Fixity

Fixity is the static, invariable, and changeless state of a digital asset. There are several roles fixity checks play: to validate digital content preserved is what was intended to be preserved, to ensure a data transfer is indeed an exact “bit-for-bit copy” of its source, to actively monitor integrity of digital objects, to identify early signs of data rot, and to determine when a data change or loss event has occurred. Assessing fixity is an important responsibility we must bear to encourage users to trust the assets we manage. Fixity is contingent on hard evidence that comes in the form of a checksum.

Checksums

The three most popular types of algorithms that produce checksums are MD5, SHA–1, and SHA–2. “Applying these algorithms to a file produces an (almost certainly) unique hash or checksum value and will consistently produce this value if a file is unchanged” (Austin, 2011, p. 5). Comparing these checksums, sometimes called hashcodes, can help asset managers deduplicate assets (Riecks, 2010. Some forward thinking DAM systems offer this function. None yet offer the ability to verify file integrity, in spite of calls for action. In principle these values must be stored separately from an asset: embedding the hashcode — or any other metadata — will change the file and invalidate the checksum. More on this later.

Sample Process

  1. before starting, block write-access to the data object
  2. capture the fixity value as early as possible – ideally at the time of initial creation
  3. run a checksum on an object upon ingest and compare the value to that generated prior to ingest
  4. check the fixity of assets at regular intervals
  5. compare checksums of the source with that of the distribution copy
  6. maintain a history of changes to the checksum
  7. check fixity in response to specific events or activities (i.e., file migration or media refresh)
  8. repair or replace corrupted data

Concluding Remarks

Metadata should travel with assets and embedding this information has worked very well so far. However, adding, editing and replacing embedded metadata affects the fixity of a whole file, defeating the purpose of fixity checking. Some have suggested generating intra-file fixity values, which can then be embedded as metadata into a file. This is where checksums are run on parts of the file instead of the whole: checks would be performed on each audio track, every video frame, or layer of a TIFF file. These values would remain unaffected when metadata is written back into the whole file. Alternately, adopting a wrapper-based media file format would also be feasible, such the MXF file format. This type of file contains the file essence(s) — video, audio, image — and a metadata record. This last option is only possible if your system supports read/write into these types of information packages. Barring these choices, there is a case to be made to store metadata separately from your digital assets as sidecar files. The essence (original bitstream) remains unchanged and auditable while the separate metadata record may be continuously updated.

Do you perform fixity checks on your data? If so, how? What are some of the challenges you face?

References

Austin, T., & Richards, J. (2011). Archaeology Data Service: Preservation Policy (No. 1.3.1) (pp. 1–22). Archaeology Data Service. Retrieved from http://archaeologydataservice.ac.uk/attach/preservation/PreservationPolicyV1–1.pdf

Ball, A. (2006). Briefing Paper: the OAIS Reference Model. University of Bath: UKOLN. Retrieved from http://www.ukoln.ac.uk/projects/grand-challenge/papers/oaisBriefing.pdf

Gareth Knight (2012): A Digital Curate’s Egg: A Risk Management Approach to Enhancing Data Management Practices, Journal of Web Librarianship, 6:4, 228–250

Murray, K. (2014, March 4). It’s Not Just Integrity: Fixity Data in Digital Sound and Moving Image Files | The Signal: Digital Preservation [webpage]. Retrieved July 4, 2015, from http://blogs.loc.gov/digitalpreservation/2014/03/its-not-just-integrity-fixity-data-in-digital-sound-and-moving-image-files/

Owens, T. (2014, February 7). Check Yourself: How and When to Check Fixity | The Signal: Digital Preservation [webpage]. Retrieved July 4, 2015, from http://blogs.loc.gov/digitalpreservation/2014/02/check-yourself-how-and-when-to-check-fixity/

One Comment

Leave a Reply to Sam Smith Cancel reply

Your email address will not be published.