Document Preservation and Retrieval—Saving Old Information With New Technology

By Mark Ciotola

First published on August 24, 2019. Last updated on June 6, 2024.

Objectives

Learn about using digital technologies to preserve documents, photos and other 2D materials.
Learn about the various file types, and how to store, organize and protect digital archives.

Traditional Means of Document Preservation

In the old days, the work of historians was much like that of the Indiana Jones character: seeking ancient manuscripts and other works, sometimes traveling across the globe. Some historians still must do so, and other prefer to do so, but now there is a tremendous amount of historical information available through the internet, if you know how to locate it. However, documents don’t put themselves on the internet. Saving old physical works is another major component of digital history. This lesson concerns document preservation and retrieval, literally saving old information with new technology.

There were several ancient means of recording historical events. The earliest may have been storytelling and songs to help people learn and remember about past experiences and event sin the society. Cave paintings may have been another early means. When language began to be written (recorded in an external physical form), clay tablets, stone engraving and scrolls of papyrus paper were used. Eventually other forms of paper were used as well, and sheets were combined into books. While inventions such as the printing press resulted in the proliferation of written works, such technologies were mere improvements of the same approach.

Phonographs, film and magnetic materials in the nineteenth and twentieth centuries finally made breakthroughs in recording events and other information. Recordings of audio and visual events could be made so that future persons could experience direct sensory perceptions of those events rather than reading about them. Further, historical documents could be stored in film version and retrieved and photocopied at will. Yet these technologies were not digital.

Saving Old Information With New Technology

Digital technology allows for the preservation, storage and retrieval of historical documents. Technology has allowed such for thousands of years, so what is special about digital technology?

The term digital refers to recording and processing information ultimately as strings of numbers, ultimately as binary numbers being 0 and 1. This allows that information to be processed by computers. Computers are fast, and the information within them can be transmitted and transformed with relative ease. That means that historical recordings can be reproduced instantly. Documents can be retrieved quickly and searches can be performed easily across millions of documents to search for names, places and terms. Of course the term easy is a relative one, as compared to such searches without digital technology. Searching can still require thinking and skill, but the ratio of brain work to mere mechanical, manual activities has increased significantly.

Digital records often begin life in image form. Often they are then processed using optical character recognition (OCR) and either include a text “layer” or are converted into a text document.

Digital records and archives

Let’s first discuss information that is already present on the internet or in other digital forms. There are several important aspects of digital information:

What it is?
In what form it is?
Where it is located?
How to access it?

Is there one answer to rule them all? No! As a historian, you may be confronted with a tremendous variety of answers to these questions. Some sources will be on floppy disks, CDs or even magnetic tapes. Some will be behind paywalls. Some will be in file formats which modern computers cannot read. You can savor the exotic possibilities later. For now, the most common cases will be covered.

Images of Primary Source Documents

A three thousand year old clay tablet can be converted into digital form by simply photographing it. The image will be in a file. If you have technology that can access, read and display the image, then you can see much of the information contained by that ancient tablet. What might be even better is if the contents of the tablet are searchable, in the form of a text file. (Sometimes they will be, sometimes they won’t).

Examples of images include scans of documents and photographs. Sometimes there might be other representations, such as vector graphics.

Text-Based Documents of Primary Source

Primary sources in digital form can be original digital documents (such as notes from a meeting typed on a word processor) or in indirect digital form such as a typed up copy of a newspaper article. Text-based documents are ultimately in files.

Secondary Sources

Older secondary sources may be processed in the same manner as old primary documents. Newer secondary sources are probably already natively in a digital form and can be found by an ordinary web or database search. Unfortunately, many of these sources are behind a paywall. If you don’t want to pay, go through a library. If your university supports OneSearch, this is the easiest way to start looking. Otherwise, your library may have research guides concerning which materials it has available and how to access them. (Each library is somewhat different.)

File Types

Text Files

There are several common forms of text files.

Pure text files end with “.txt”. They only contain text and neither other types of content nor formatting information. This form is generally easy to read by humans, and suitable for computer program code. Sometimes these are called plain text files.
Rich format files contain some formatting information and end with “.rtf”. These are generally not suitable for computer programs.
Some older word processors read and generate files that end with ” .doc”. These can contain formatting information, images and other types of content.
Some newer word processors generate files that end with “.docx”. They are similar to .doc files in terms of content and information, but have a significantly different file format.
Comma separated value files have values separated by commas. Strictly speaking, these are text files, but in a form that are readable by databases, spreadsheets and other specialized software. Collections of public and historical records are often exported in this format. They often end with “.csv”.
Structured Information files are similar to database records, and may be considerably more complex than a simple .csv file. They may end with “.xml”.

Archival Files

Archival files may contain primarily text, but they often contain additional elements such as formatting. The Portable Document File (PDF) format can preserve formatting, but it has several possible deficiencies that make retrieving the document in its original for, problematic. However, the PDF/A is a preferred archival file format. It attempts to maintain device independence, self-containment and self-documentation. For example, it requires embedded fonts rather than linked fonts, in case the linked source is no longer available. The PDF/A format prohibits the inclusion of audio, video, Javascript and executable content. So this format may be suitable for traditional print media but not for your favorite video game.

There are several versions of PDF/A files, such as PDF/A-1, PDF/A-2, and PDF/A-3, with the higher numbers allowing for embedding more advanced content such as richer graphics. (If interactivity is required, the PDF/E format might be considered, albeit at the loss of some portability.)

Image Files

There are several common types of image files:

.gif—good for illustrations
.jpg—good for photographs, compressed format to save disk space and load faster
.png—good for illustrations and photographs but may not be supported by all platforms.

Audio-Video Files

There are several common types of audio and video files:

.mp3—a common audio file. Compressed.
.mp4—most common video file. Compressed.
webm—a popular alternative video format
.mov—a video format used by some Apple applications
.flv—a Flash video format, but Flash may have security issues

Digital Preservation Technology

Although there are many physical ways to help preserve physical documents and artifacts, this discussion will focus on digital technologies. It is possible to merely collect information about an object, such as radar scanning of a large archeological site, or to actually reproduce essential quantities of an object. The most common technologies involve some form of scanning, which is generally non-invasive, minimizing the possibilities of damaging the object.

An early form of scanning was a variation of that of scribes who visually copied documents, except that a human would enter the text contents of a document into a work processor document. Yes, it was slow, but provided steady paid work for some graduate students. The more modern method is to simply scan documents using a photocopier, fax machine or dedicated scanning machine. Advanced scanning machines can automatically page through an entire book, although this is not recommended for rare or fragile works. Some advanced scanning stations feature very high quality cameras.

Desk with glass and metal scanner set up in V shape.

Book scanning work station (credit: Jason “Textfiles” Scott. CC BY 2.0)

Then the digital image file resulting from the scan is often run through software that recognizes characters (optical character recognition or OCR) and can separate the images into their own files. OCR works pretty well for clean text of common fonts of modern English, but extra steps will probably be required for nearly anything else.

3D scanning and printing can be used for 3D objects, with the caveat that most 3D scanners are not very large.

Resources, Platforms & Services

Archivematica. A web- and standards-based, open-source application which strives to allow institutions to preserve long-term access to trustworthy, authentic and reliable digital content.
Axaem. A records life-cycle management system for records managers and archivists
CONTENTdm. A tool to build and showcase digital collections on personalized websites.
JHOVE. A file format identification, validation and characterization tool, useful for files such as PDFs.
LOCKSS. Services and open-source technologies for high-confidence, resilient, secure digital preservation. Strives to provide a reliable mechanism for long-term digital integrity assurance and access.
Omeka software for sharing digital collections and creating media-rich online exhibits
Portico. Strives to provide libraries and publishers with reliable preservation of electronic resources, and expertise and technical assistance to national libraries, so that that their content will be accessible to researchers, scholars, and students in the future.
Rosetta. Digital asset management and preservation solution for libraries, archives, museums and other institutions.

Digital History