The Art of Life: Discovering Illustrations in the Biodiversity Heritage Library

Ed Bachta, USA , Charles Moad, USA, William Ulate, USA

The IMA Lab at the Indianapolis Museum of Art is partnering with the Missouri Botanical Garden as part of an NEH grant to liberate natural history illustrations from the digitized books and journals in the online Biodiversity Heritage Library (BHL). These visual resources were produced by fine illustrators including John James Audubon, and support the scientific claims of prominent naturalists such as Charles Darwin. Identification of these images can assist scholars and teachers who use BHL to support their work in the fields of science, art, and history. The goal of this partnership is to develop algorithms that can analyze the scans and metadata resulting from BHL’s collaboration with the Internet Archive to determine which of the millions of pages in the library have illustrations.

BHL contains millions of files representing the pages in the collection, including scans, book metadata, and OCR output. To process this information, the team will make use of the compute cluster at BHL to perform parallel processing. Algorithms run on the cluster will compute metrics to characterize the pages in the BHL, storing this information in a MongoDB NoSQL database. Analysis of these metrics will then be performed to determine whether each page contains an illustration. This paper will describe these algorithms as well as findings made during analysis. From these findings we may be able to determine if the approach can be applied to illustrated texts from other domains, such as art catalogues.