Archive

Archive File Normalisation Policy

What is normalisation?

Normalisation is the migration (or transcoding) of a digital file from its original format to a new file format.

What role does normalisation play in our use of Archivematica?

Archivematica allows users to normalise files for both preservation for access purposes. In other words, we can create two version of our original data files using two normalisation processes – one for long-term preservation, stored in the back end of our archive, and one for access purposes, to be accessed by public visitors to our online archive.    

Background and Rationale:

In January 2020 I attended a meeting of the Archivematica UK User Group at the University of Westminster in London. Rachel MacGregor, the convenor of the group, raised the question of Archivematica’s default normalisation policies and what to do about them. Many users getting to grips with Archivematica are understandably tempted to stick with Archivematica’s default settings, trusting that the defaults are in place for good reason, without interrogating whether these meet their specific preservation requirements. However, someone made the interesting point that Archivematica's default normalisation policies are based on their assessment of whom the average user of archivematica is - i.e. an archivist working in a special collections library. However, those who don’t fit this mould (like ourselves!) should consider whether default normalisation settings are appropriate and, if not, consider changing them. Indeed, Archivematica’s normalisation policies exist to be edited; customisation is encouraged amongst the community of users.

Some users question whether certain files, particular ubiquitous formats such as JPEGs, should be normalised at all. Normalising often means the conversion of compressed files (such as JPEGs or MP3s) into much larger uncompressed formats (such TIFF or WAV). Compressed video files, in particular, can increase in size dramatically when normalised into uncompressed FFK1-encoded Matroska (MKV) video files, as happens under Archivematica’s default normalisation policies (see Blewer for a discussion of why). This type of normalisation increases the size of the AIP and therefore the storage space needed, without increasing the quality or the amount of information contained within the file (Mitcham 2017). This also has environmental implications: larger files take up greater server space, resulting in a larger carbon footprint (Walsh 2019).

 Concordia University Library, for example, have made the decision to disable normalization to ffv1 in mkv for H264-encoded mp4s (in other words normalising a compressed video format into an uncompressed format) and normalization of PNG images into TIFFs. They are also considering disabling the normalisation of JPEGs into TIFFs  (Walsh 2019). Other institutions make a value judgement about what material is more important to preserve, and normalise the files considered to be of greater value into larger uncompressed formats. (Richan 2019)

However, as Walsh acknowledges, this is less of a risk for institutions a have a well-resourced infrastructure for digital preservation. For example, Concordia mitigate against the instability of compressed file formats by making multiple copies of the files, which they keep on different media and in different locations, and they have systems in place to perform regular checks on the ongoing integrity of the data files. On our project we do not have these resources or an existing data preservation infrastructure – we are essentially creating this from scratch. Therefore we would face greater risks by storing files in compressed formats for the purposes of preservation.

As this suggests, there remain definite advantages to normalising into uncompressed files. Often, these are the file formats which digital preservation workers currently consider to have the greatest longevity, and to be more resilient against corruption, often describing them as “preservation friendly” formats. Normalizing can also reduce the range of different file formats in your archive, meaning that you have a smaller amount of file types to manage when conducting ongoing preservation management tasks. Furthermore, normalising files allows you to spot any potential issues with the files at the ingestion stage. Files may not normalise properly, alerting you that there is a problem with the file, which you might not have otherwise noticed, but can now fix, rather than archiving a potentially unstable file which you may not render later on (Mclellan 2019).          

Archivematica normalises files into preservation friendly formats by using a range of different open source tools, which are embedded in its software environment. However, for some file types are not recognised by Archivematica, have no corresponding normalisation tools, and hence no established way of automatically normalising files. For example, we are using the relatively new and niche Web Archiving file format (WARC), for which a normalisation tool or policy does not exist. However, Archivematica will still preserve the file in the AIP, but in such cases we have to accept that the archived file will remain in the same file format that we created it. This is not necessarily a problem – in the case of .WARC files there is open source software available to view these files which should provide some promise of longevity. 

In February 2020 Artefactual will launch a new version of Archivematica. Having solicited feedback from users, Artefactual have decided to remove default preservation normalisation for video files entirely. This is because of feedback that normalisation of video files resulted in huge output files, sometimes 10 to 20 times the size of the originals, and that many of the formats currently set to be transcoded are widely supported for playback (McLellan 2019b). They will also remove defaults for transcoding JPEGs (except JPEG 2000), PNG and GIF files to TIFFs, again because they are easily rendered and ubiquitous (University of Glasgow 2019), and, alongside TIFFs, are listed as preferred preservation formats by the Library of Congress . I will need to customise the settings again when the new version is launched, to ensure that the settings we are using are consistent before and after the update.

What type of data files have we collected? What are their file formats?

ASCII text/ ASCII English text [character encoding for excel-generated .csv files]

ISO Media [MPEG-4 Video (.mp4);.mov]]

Digital Negative Format (DNG)

JPEG image data [.jpg and .jpeg]

PNG image data [.png]

TIFF image data [.tif]

MP3 files [.mp3]

Advanced Video Coding High Definition (AVHCD) video format [file extenstion .mts with codec mpeg]

gzip compressed data [.gz] [warc cache files]

Web Archive File [.warc]

Microsoft Word 2007+ [.doc]

Webm [.webm] audio-visual

PDF document [.pdf]

Audio file with ID3 version 2.3.0MPEG ADTS/ Audio file with ID3 version 2.4.0. ID3 is metadata container commonly used by .mp3 files. [mp3]

UTF-8 Unicode English text [character encoding for.csv files]

How is archivematica set to normalise these files under its default settings?

File type on sftp

Normalisation for preservation

Normalization for access

Normalization for thumbnails

TIFF

TIFF

JPEG

JPEG

JPEG

TIFF

JPEG

JPEG

PNG

TIFF

JPEG

JPEG

DNG

TIFF

JPEG

JPEG

MPEG-4 Video (.mp4)

.mkv with ffv1

MPEG-4 (.mp4)

MPEG-4 (.mp4)

M2TS (.mts)

No normalisation rules (customise)

No normalisation rules (customise)

No normalisation rules (customise)

Web Archive File Format (.warc)

No normalisation rules

No normalisation rules

No normalisation rules

MPEG 1/2 Audio Layer 3

.WAV

.mp3

n/a

Webm

No normalisation rules

No normalisation rules

No normalisation rules

Acrobat PDF

PDF/A

No normalisation rules

n/a

Comma separated values

No normalisation rules

No normalisation rules

No normalisation rules

 

Should we stick with the default settings?

Overall, I believe we should stick with the default settings for normalising for preservation, albeit with some exceptions. This is because we do not have the capacity or the infrastructure to make multiple copies of our data files (mostly compressed files) to be kept on different media and in different locations, or to perform regular checks on the ongoing integrity of the compressed data files. Our AIPs will be stored on one server, and hence our best insurance against the corruption of the data files in our AIPs is to normalise them into uncompressed file formats, which are more resilient. 

Many community users have decided not to normalise video files as FFV1-encoded Matroska (MKV) video files due to the large size of MKV files, and the burden this places on budgets and server space. Unsustainable budgetary pressures for storage space are also a threat to long term preservation. However, we have decided to stick with MKV files as default.

It is worth quoting Ashley Blewer (2019) an a/v preservation specialist, on the advantages of MKV files which would justify their large size.

“I think the best possible choice a preservationist can make for the storage of their video assets is FFV1-encoded Matroska video files with LPCM audio because:

  •  the formats are open and can always be understood/deciphered/transcoded/accessed far into the future; 
  • it uses a losslessly compressed algorithm that saves on storage but can be reverted back to an uncompressed data stream; 
  • there are frame-level and section-level checksums so errors can be narrowed down to specific frames or parts of the file instead of the entire thing; and
  • of the ability to embed and attach robust metadata into the file itself.”

The one exception is the way we handle .MTS files. During fieldwork, we occasionally forgot to set the Panasonic camcorder’s settings to record in .mp4, and footage was recorded in as .mts files, a default format developed by Sony and Panasonic, which is listed on PRONON under the M2TS file format title. There are no established normalisation rules associated with this format in Archivematica, meaning the file would be included in the AIP unaltered. As a proprietorial format this might not be the most preservation friendly format. Also, you cannot playback .mts video files within AtoM’s interface, so if we decided to make these files available on AtoM people would have to download them to view in VLC media player I posted to the Archivematica google group to ask for advice, and Ashley Brewer let me know how to create a normalisation command and rule under the preservation planning tab on the archivematica dashboard to convert .mts files to uncompressed MKV files, which I will do and use as the default normalisation setting.

The normalisation for access defaults are also suitable for our purposes, as they normalise uncompressed formats into smaller compressed files, which will be quicker to load for users in Nigeria who may be accessing the AtoM site using smartphone on smaller bandwidths.  Much of the discussion and contention outlined in the background and rationale section above concerns Archivematica’s normalisation for preservation defaults, its normalisation for access defaults appear relatively uncontroversial.

 

Further reading

 

https://www.dpconline.org/blog/idpd/finding-the-balance