Digging Deeper into Document Repositories Activities
By Mark Ciotola
First published on February 15, 2020. Last updated on February 18, 2020.
- Students will perform queries on actual historical databases.
- Students will learn how to gain enhanced access to online repositories of historical documents and information.
Text parsing and processing
Once you find relevant documents, you might need a faster way to search for content of particular interest than reading everything. Text parsing is a way to search for particular terms or fragments. It can get much more sophisticated than simply typing a search term. Parsing involves searching and sometimes changing text in an automated manner. For example, one may wish to search a collection of ancient documents for a particular person’s name, for a certain period, while omitting another person’s name.
Parsing requires the text to be in a computer readable form. Optical Character Recognition (OCR) software can concert images containing text into searchable text documents. There are many considerations required in parsing. For example, are there different spellings of that person’s name? Is the capitalization of the name inconsistent? Is that person known my nicknames or abbreviations?
There are many tools for parsing. The most common is the simple find, or find & replace, command in word processors, text editors, and many other applications. So you do not necessarily need to write your own program for this. However, you may have to become skilled at writing expressions to find exactly what you want.