Level Up: Perl and Workflows

By Mark Ciotola

First published on February 15, 2020. Last updated on June 12, 2024.

Level Up

Students will write a brief PERL program to parse a sample document.

Parsing a block of text means to find one or more characters, and flag that group or change it. This is a very important skill in both literature research and professional programming.

It is easy enough to search for a short group of characters (also called a “string”) in a word processing document. Word processors often make it easy. However, sometimes you will need to do a more complicated search or efficiently go through many documents.

Let’s examine an example. Say you were looking for all references to the name Jean Doe in a large collection of digitized letters and public records. Here is a simple way to do it:

Open file
Search for “Jean Doe”, and mark position of each find.
Close document.
Repeat until all documents have been searched.
Export report of all found instances.

Easy enough, kind of. Except that names often get misspelled or translated.

Jean could be spelled as “Gene” or translated as Jeanne or John. So you might have to search for those and similar cases as well. Or there might be spaces or hyphenations in the middle of the name, so you might also have to search for “Je an”. Or what if you are looking for Jean Doe only written as a stylized signature. Then you might have to run an image recognition search. You can only do so much, and the importance of what you need to find and your available resources will dictate your level of effort. However, this is certainly not an exact science!

PERL is an older computer language, but it is good for searching for patterns in text. Use the short course Perl Programming Language to become familiar with PERL basics.

« Digging Deeper into Document Repositories Activities | COURSE | Databases »

Digital History

Level Up: Perl and Workflows

By Mark Ciotola

Level Up

Content is copyright the author. Layout is copyright Mark Ciotola. See Corsbook.com for further notices.