Journal of Library Metadata, 10:136?155, 2010
Copyright ? Taylor & Francis Group, LLC
ISSN: 1938-6389 print / 1937-5034 online
DOI: 10.1080/19386389.2010.506400
The Biodiversity Heritage Library:
Advancing Metadata Practices
in a Collaborative Digital Library
SUZANNE C. PILSK
Smithsonian Institution Libraries, Washington, DC, USA
MATTHEW A. PERSON
MBLWHOI Library, Marine Biological Laboratory, Woods Hole, Massachusetts, USA
JOSEPH M. DEVEER
Ernst Mayr Library, Museum of Comparative Zoology, Harvard University, Cambridge,
Massachusetts, USA
JOHN F. FURFEY
MBLWHOI Library, Marine Biological Laboratory, Woods Hole, Massachusetts, USA
MARTIN R. KALFATOVIC
Smithsonian Institution Libraries, Washington, DC, USA
The Biodiversity Heritage Library is an open access digital library of
taxonomic literature, forming a single point of access to this collec-
tion for use by aworldwide audience of professional taxonomists, as
well as ?citizen scientists.? A successful mass-scanning digitization
program, one that creates functional and findable digital objects,
requires thoughtful metadata work flow that parallels the work flow
of the physical items from shelf to scanner. This article examines the
needs of users of taxonomic literature, specifically in relation to the
transformation of traditional library material to digital form. It de-
tails the issues that arise in determining scanning priorities, avoid-
ing duplication of scanning across the founding 12 natural history
and botanical garden library collections, and the problems related
to the complexity of serials, monographs, and series. Highlighted are
the tools, procedures, and methodology for addressing the details
of a mass-scanning operation. Specifically, keeping a steady flow
of material, creation of page level metadata, and building services
on top of data and metadata that meet the needs of the targeted
Address correspondence to Suzanne C. Pilsk, Smithsonian Institution Libraries, P.O. Box
37012, MRC 154, Washington, DC 20013-7012, USA. E-mail: pilsks@si.edu
136
Biodiversity Heritage Library 137
communities. The replication of the BHL model across a number of
related projects in China, Brazil, and Australia are documented as
evidence of the success of the BHL mass-scanning project plan.
KEYWORDS Biodiversity Heritage Library, taxonomic literature,
digital libraries, digitization projects, digitization workflow, mass-
scanning projects, collaboration, natural history libraries
?In any well-appointed Natural History Library there should be found
every book and every edition of every book dealing in the remotest way
with the subjects concerned? (Sherborn, 1932).
BIRTH OF THE BIODIVERSITY HERITAGE LIBRARY
The early 2000s saw the enthusiastic embracement of developing digital Web
technologies across various fields of study and methods of research. New
approaches to traditional work were attracting research institutions, natural
history museums, taxonomists, and libraries. This was the beginning of bridg-
ing across silos of information and closed communities of practice to create
integrated communities of knowledge. Natural history libraries saw this as
an opportunity to explore how to better support the needs of taxonomists,
nomenclaturists, and the general species-identifying community. In 2005 a
meeting was held at the Natural History Museum in London, referred to
by participants as LibLab: Library and Laboratory; the Marriage of Research,
Data and Taxonomic Literature. This meeting helped to elucidate a pend-
ing ?perfect storm,? defined by the confluence of reasonable scanning costs,
significant scannable library collections, and a geographically dispersed and
demanding user community. The outcome was clear; there was a need to
move toward the implementation of a global digital library project.
At the time it appeared to be a novel idea, but it was clear that scientists
around the world would use a digital library of taxonomic literature. Borrow-
ing a term from another scientific field, the half-life of taxonomic literature
is longer than that of any other scientific arena (Moritz, 2005). In most other
sciences, the immediate need is for current publications revealing the most
recent discoveries and the most salient laboratory data. In taxonomy, it is the
historical literature that is critical for the identifying and naming of species.
The state of the art of scanning and serving literature via the Internet was
mature enough to handle the needs of this community. Old, rare, and re-
quired literature could be provided electronically wherever and whenever
the researcher needed it.
To move forward with the successes discussed at LibLab, a partnering
group formed naturally from participants and produced a memorandum of
understanding establishing the Biodiversity Heritage Library (BHL). Initial
library members included the American Museum of Natural History (New
138 S. C. Pilsk et al.
York, NY), Harvard University Botany Libraries (Cambridge, MA); the Ernst
Mayr Library of the Museum of Comparative Zoology, Harvard University
(Cambridge, MA); MBLWHOI Library of the Marine Biological Laboratory
and Woods Hole Oceanographic Institution, (Woods Hole, MA); Missouri
Botanical Garden (St. Louis, MO); the Natural History Museum (London);
New York Botanical Garden, LuEsther T. Mertz Library (New York, NY); the
Royal Botanic Gardens, Kew (London); and Smithsonian Institution Libraries
(Washington, D.C.). In May 2009, two additional members, the Academy of
Natural Sciences (Philadelphia, PA), and the California Academy of Sciences
(San Francisco, CA) joined the consortium. Funding was made available
through a grant from the MacArthur Foundation (via the Encyclopedia of
Life http://eol.org/) to begin digitization with the Internet Archive as the
BHL scanning partner.
The mass-scanning project was launched via easy decisions stemming
from a clear focus on biodiversity literature, combined with a realization
of the need to work out challenging workflows. Librarians at the partner li-
braries clearly defined the scope of the materials to be scanned: anything out
of copyright that was critical to the work of their clientele. The challenges
that emerged were identifying the specific titles, avoiding redundant work,
and, if possible, avoiding duplication of scanning. To achieve economies of
scale, the three mass-scanning centers (located in New York City, Washing-
ton, DC, and Boston) established by Internet Archive demanded a sizable
quantity of material. The ?mass? scanning approach insisted on an effective
and efficient work flow to process the quantity required by the Internet
Archive ?beasts.? ?Feeding the beast? became a catch phrase to each BHL
member library that described the need to keep the material flowing to the
scanning centers. The operation needed to run continuously and the quality
of the finished digital product had to be acceptable.
As much a sociology project as a scanning project, BHL gave rise to a
multi-institutional team of international librarians that grew organically from
the partnering libraries in an extremely positive and collaborative atmo-
sphere. In other words, a major asset has been the collegiality and agree-
ment among team members on the fundamental direction of the project. The
librarians on the frontline have worked out effective ways to communicate,
share expertise, and lend helping hands to one another for the good of the
project and to ensure a successful initiative. This paper is indicative of the
cooperative spirit underlying the BHL. In fact, the American Library Associ-
ation is recognizing the collaborative nature of the project by awarding the
2010 ALCTS Outstanding Collaboration Citation to BHL.
TAXONOMISTS AND LIBRARIES
Natural history libraries have historically been in a unique position to sup-
port the work of taxonomic nomenclature due to the requirements of the
Biodiversity Heritage Library 139
field. In all areas of systematic taxonomy, scientists identifying and nam-
ing species follow specific rules and guidelines based on a system created
by the Father of Taxonomy, Carl von Linne`, a.k.a Linnaeus. During a re-
cent broadcast of the popular American television show Jeopardy! a final
question referred to the mnemonic that most students use to remember the
Linnaeus? outline: ?Kings Play Chess On Fine Grain Sands? equals Kingdom,
Phylum, Class, Order, Family, Genus, Species. The Linnaen system of clas-
sification is the art of naming species resulting in a system that is orderly,
not overly redundant, attempts to avoid duplications and synonyms for the
same species, and can be communicated across disciplines, languages, and
oceans. Each taxon community has its own rules for its specific area of ex-
pertise (e.g., botanists and zoologists each have unique sets of rules, and
specific taxon subgroups will choose a specific classification scheme). A ma-
jor element in all the rules agreed upon by the nomenclaturists who are iden-
tifying, naming, or revising species?whether a bug or a sprout, endoskele-
tal, exoskeletal, or invertebrate?currently involves the printed published
literature.
The preamble of the International Commission on Zoological Nomen-
clature?s International Code1 states: ?The objects of the Code are to promote
stability and universality in the scientific names of animals and to ensure that
the name of each taxon is unique and distinct.? The code specifies that ?the
name or nomenclatural act must have been published? (Article 11.1) and the
publication must be freely available to the public. The rules continue with
specifically mentioning libraries as holders of the published records.2
Natural history libraries are critical for preserving and maintaining taxo-
nomic information. Libraries are mandated by these rules to store the publi-
cations that name species as they are discovered. As curators of these library
collections, librarians are obligated to have freely available copies for all
scientists doing research on species to review and discover what has been
found, documented, and named. In general, larger and well-funded nat-
ural history libraries have been hosting natural history publications since
the 1700s. Since that time, researchers have traveled to these collections or
borrowed from them to conduct their work. Systematic taxonomy publica-
tions require species citations from the published works. In other words,
the published literature is critical to discovery, revising, and naming of
life.
Usually found in the form of a ?Linnaean binomial,? the Latin-modeled
name is used to name the organism. A complete taxonomic citation includes
the original scientist?s name (the ?author? of the species) and date of dis-
covery. Typical naming is familiar to most: ?Homo sapiens? (Linnaeus, 1758)
translates to genus Homo and species sapiens, as named by Linnaeus in his
published work of 1758. The 1758 edition of Linnaeus? major work Systema
Naturae, was the first complete edition and is generally considered to be the
starting point for taxonomic nomenclature.3
140 S. C. Pilsk et al.
PROBLEMS IN SOLVING REAL NEEDS OF TAXONOMISTS
Over the centuries, scientific nomenclature specialists have been forming
species citations using abbreviations and notes that accommodate their spe-
cial fields of study. These practices developed in an isolated manner specific
to discipline (or subdiscipline) and independent of related species projects
or the expertise of librarians and information professionals. Concurrently,
librarians developing and implementing metadata policies and procedures
never entered into conversation with taxonomists. Librarians did little to dis-
cover how taxonomic citations are formed and what the actual access needs
of the scientists were. As a result, each group adopted its own way of pro-
cessing information, and constant translation between the two worlds was
necessary. At best, a clumsy, but acceptable, disconnected reliance existed
between these two fields.
On the scientific researcher side of things, species citations define an
?author? of a species as the scientist that identified and named the living
thing, not the authors of the book that holds the citation. Abbreviations
are used throughout taxonomic citations. The example of Homo sapiens was
given above. Another is the taxonomic classification of the standard goldfish:
?Carassius auratus (Linnaeus, 1758).? Another example of a typical descrip-
tive citation is that for the dolphin found on page 133 of the Catalogue of the
specimens of Mammalia in the collection of the British Museum, published in
1850 by the British Museum Department of Zoology:
Delphinus albimanus, J. Peale, U.S. Exp. Exped. 33 (t./.f. Lined.). Snout,
head, back, tail and dorsal fin blue-black; belly and pectoral fin white;
sides pale tawny; eyes small, brown, and surrounded with a black ring,
which joins the black of the snout; body between the dorsal fin and tail
very much compressed (Gray, 1850).
In the above taxonomic citation, J. Peale is the person who named or au-
thored the scientific binomial Delphinus albimanus, publishing his work
in the U.S. Exploration Expedition. The above citations raise a series of
questions that impact identification and understandability and the ability to
resolve to the proper book, volume, and page. Obvious problems arise from
the lack of standard bibliographic description as expected by librarians, in-
cluding the main entry or access point.
The abbreviations of the titles, a single date, and an author that might
not be considered even traceable in traditional library cataloging of a mono-
graphic series frequently lead these citations to dead ends in the traditional
intergrated library system. A trained librarian can read and translate, yet a
computer resolving to a digitized book fails. Traditional cataloging lacks the
access points of the commonly used bibliographic short or brief title. The
ISBD punctuation does not translate to the standards used by the taxonomist.
Biodiversity Heritage Library 141
It seems that the library community is unaware of the frustration some
scientists experience in trying to decipher library metadata. As the BHL
project began initial planning, informal interviews with potential users were
conducted. A surprising conversation with a botanist was revealing. Although
she fully supports libraries and her home institution research library in par-
ticular, she relies primarily upon basic botanical reference tools to find what
she really needs when doing her research. She was sympathetic to the li-
brary?s need to place a physical object in one place on the shelf and have
it retrievable by basic access points, but the amount of detail she requires is
not contained in those access points. Furthermore, the data that is exposed
is in a format that needs cross-walking to botanical metadata.
Librarians not familiar with a given discipline were surprised to learn the
number of abbreviations and amount of assumptions made in the citations
that the specialist understands and translates quickly. This makes for difficult
computer-to-computer resolving within this new digital world. The mass
scanning of the literature solves the issues related to getting the material out
to the community, but the overwhelming challenge of translating between
century-old systems of notations is not as forthcoming.
MASS SCANNING WORK FLOW AND METADATA ACHIEVEMENTS
Funding requirements created deadlines and specific deliverables expected
of the BHL. The funds from the Encyclopedia of Life4 ($50 million, 10-year
landmark project to develop one Web page per known species) necessitated
the need to start scanning swiftly during the Fall of 2007. Initially the BHL
partners decided to forego a formal analysis of our subject-specific collections
data. However, the OCLC Worldcat Collection Analysis tool was used for 1
year to help the individual administrative staffs of the BHL partners make
subject-specific decisions, support granting requests, and gain a coherent
overview of collection strengths. The results did not, in a practical sense, get
the books to the scanners.
The mass-scanning assembly-line?like work flow initiated a need to
solve some basic bibliographic communication issues that relied on the meta-
data used by the partnering libraries. Some important lessons that developed
out of the Biodiversity Heritage Library (BHL) project were in an attempt to
combine metadata elements to drive the scanning and describe the result-
ing titles, volumes, and pages accurately. BHL library staff members met in
person, on the telephone, via email, and through wiki entries, discussing
these issues to resolution. They pooled their strengths and expertise and
developed critically helpful tools (detailed below), met with Internet Archive
programming staff, and put forth workable solutions to begin the initial and
critically successful scanning that took place during the first 2 years of the
project (2007?2009).
142 S. C. Pilsk et al.
FIGURE 1 View of serials bid list from 2007 (Diacritic display issues have since been
resolved).
SERIAL SELECTIONS
The Natural History Museum in London stepped up to the cricket batting
box and worked on a program using CakePHP to construct a tool that
would enable development of an effective serials scanning work flow.5 The
Serials Bidding List tool (initially known as the Serials Mashup and Union
List) was constructed (see Figure 1), and all BHL partners populated the tool
by providing MARC dumps of their serials records and holding statements.
Out of an initial file of 119,377 records, matching was performed using
OCLC numbers, ISSNs, and title (245), out of which 70,764 unique titles
were identified. This union list of serials was then viewable and editable by
all BHL members.
The Serial Bidding List gives clear indication of the library claiming a
title they intend to scan, as well as volumes and years. The tool and the
agreed upon procedures allow for subsequent editing of the bid to reflect
which titles, volumes, and years actually did get scanned. It also allows other
libraries to bid on the same title, scanning volumes that were not completed
by the initial partner. Manual deduping and merging to consolidate multiple
institution records for single titles is performed on an ?as the title is touched?
basis.
The scanning centers run by the Internet Archive have multiple scanning
stations and require the constant through-put of material. Serial titles with the
volumes and extensive runs provided the proper ?food? to keep the scanning
Biodiversity Heritage Library 143
?beast? in operation. In the natural sciences, with such a long history of
publishing, there are many serial title runs that take up significant library-
shelf real estate in most museum and botanical garden library collections.
It was essential that the BHL partners identify serial runs that could quell
the beast?s appetite. The Serials Bidding list enabled a successful start to this
project.
MONOGRAPH SELECTIONS
The MBLWHOI Library informatics team at the Marine Biological Laboratory
and Woods Hole Oceanographic Institution developed an in-house mono-
graphic analysis tool. Even OCLC, which hosts arguably the most extensive
metadata collection and has a research staff targeting library tool develop-
ment, will say that machine de-duplication is not 100% successful. The BHL
Monographic Deduper is not without its flaws, but it allows libraries to begin
to select monographic titles to get them to the scanner.
The BHL Monographic Deduper is a Web application developed using
the Ruby on Rails open source Web framework. Designed with the workflow
of BHL librarians in mind, this tool makes use of the packlists created to
accompany each shipment of items to the scanning centers. Each BHL library
has its own workspace in the application to which packlists (in .xls format)
can be uploaded. After discussion, the BHL librarians agreed to standardize
the packlists to contain the following column headers: local ID, OCLC, title,
author, volume, chronology, call number, publisher, and publisher place.
While other columns may exist in the packlist, the standardized headers must
exist for the tool?s ingest process to work correctly. Upon upload, a packlist is
parsed and all records are added to the database and a new entry is recorded
in that library?s work space containing the name of the packlist, the upload
date and time, and options to view duplicates, show the entire packlist, or
delete the packlist. Viewing duplicates for a given packlist will initiate five
SQL queries against the entire database of previously uploaded packlists.
Possible duplicate items are displayed in the following five buckets:
? Duplicates by OCLC and Volume
? Duplicates by OCLC only
? Duplicates by Title and Author
? Duplicates by Title and Chronology
? Duplicates by Title only
For each possible duplicate record, the option to view the suggested dupli-
cate and its scanning library is given (see Figure 2). If a suggested duplicate is
determined by a librarian to be a valid duplicate, that record can be deleted
from the current packlist right from the Deduper?s user interface. Once the
144 S. C. Pilsk et al.
FIGURE 2 Results screen in Monographic Deduper.
list of suggested duplicates has been reviewed by a librarian and valid du-
plicates have been removed, an updated packlist can be downloaded in .csv
format. As with most of the coding work done for BHL, the Monographic
Deduper is made available as open source.6
But even with these tools in hand, it was the partnership and collab-
orative spirit that truly got all the BHL partners moving to get books to a
scanning center. Common collegial communication became integrally linked
with the use of the above serials and monograph tools, and subject selec-
tion took place: MBLWHOI Library would do sea creatures, NHM London
would do general serial runs, AMNH would begin with birds, MCZ Ernst Mayr
Library would begin with amphibians and reptiles, and the Smithsonian Insti-
tution libraries would begin with entomology. The botanical libraries worked
out the areas of strengths and committed to scanning their host institution
publications.
METADATA CHALLENGES
Serials are an interesting metadata challenge. It?s well known that librarians
either fall in love with the complexity of serials title changes, merges, splits,
prediction patterns, publisher changes, societies; or they are annoyed by
these changes given the work required to update catalogs. It became appar-
ent that a number of BHL library staff actually have a love-hate relationship
with serials: we love that they fill up a shipment to the scanning ?beast? and
have the complexity of coverage for the BHL until we hit one in a foreign
Biodiversity Heritage Library 145
language that spans such a long time period as to have war interruptions,
topic changes, and title changes that flip back and forth. What was surprising
to most of the BHL staffers though, was how little is known about serials
outside of the library world.
BHL staff met for a targeted meeting with Internet Archive program-
mers to explain the concept of volumes and issues, series and serials, and
the idea of separate libraries binding the same title into different packets.
Terminology used by various staff doing various aspects of a mass-scanning
project was quite enlightening. Team members saw this development as a
true sociological undertaking. Engineers and programmers think a ?book?
is something that has cardboard on either side??cardboard to cardboard.?
Serials are bound into ?books? and, therefore, the metadata describing the
title of the serial is exactly the same for each of these books. This was seen
as a warning sign by the catalogers participating in the project. With serial
runs of hundreds of volumes, discovery of the proper volume post scanning
is impossible with such a definition. Through these discussions we learned
the lesson that any librarian who wants to try to explain volumes, issues,
and numbers bound over a span of hundred years to someone who thinks
in ?cardboard to cardboard? should never try to convey the details over the
phone. Whiteboards and face-to-face meetings with examples in hand are
the only way to resolve these issues.
Once a clearer understanding of the metadata needs of traditional serials
had been resolved, we tackled the following questions: (1) How do you
transfer the metadata associated with a physical volume to the scanning
center, (2) How is that information ingested into the Internet Archive digital
system, and then (3) How should the data then be transferred, ingested, and
displayed in the not yet fully developed BHL portal?
PACKING LISTS TO SCANNED BOOKS
WonderFetch (BHL would like to trade mark the phrase!) was developed
very quickly by the Internet Archive once the concept was clearly under-
stood. It accommodates the needs of serial holdings and as a bonus it has
built in copyright clearance language. WonderFetch supplements the basic
bibliographic description allowing each ?book? to have the title description,
specific volume, issue number, and date metadata. Though simple (not sim-
plistic), it is extremely important for discovery and proper identification of
serials.7
WonderFetch makes use of a spreadsheet application to compose URLs
for each item in a shipment bound for the Internet Archive scanning center.
Internet Archive engineers devised this as an efficient mechanism to populate
their biblio database with both bibliographic and item-level data from library-
generated packing lists. An Excel (or OpenOffice) spreadsheet packing
146 S. C. Pilsk et al.
list is created by the library for each shipment of materials sent to the
scanning center. The packing list contains data for each item in the ship-
ment. This data typically includes library name, library catalog number (or
some unique number to identify the correct bibliographic record within the
library?s integrated library system), barcode number, volume/issue/part des-
ignation, chronology, call number, title, author, date of shipment, and special
notes or instructions to the scanner (see Figure 3). This data is copied to a
WonderFetch spreadsheet template containing formulas and a selection of
copyright statements. For each item in the shipment, the appropriate copy-
right statement (e.g., ?Not in copyright,? ?Digitized with the permission of
the rights holder,? etc.) is selected, and the bibliographic, item-level data
and copyright information are concatenated to create URLs. By clicking on
the URLs formulated within the spreadsheet, bibliographic data is ?fetched?
from the library catalog via a negotiated Z39.50 connection, and item-level
data (volume, number, date, etc.) are pulled from the spreadsheet. Thus,
for serials items especially, essential metadata is extracted and stored in the
meta.xml files for each volume, issue, or part. Items receive appropriate
description in the BHL portal (see Figure 4).8
FRANKENBOOKS
Frankenbooks is the term used within the BHL refering to the digital ver-
sion of a title that might have pages scanned and ?stitched together? from
different physical books. How to conceptually address the handling of this
situation was debated within the group. The goal of getting the data locked
on the printed page digitized, OCR?ed (i.e., optical character recognition
applied), and made available to the wide user community seemed to sup-
port the idea of getting the title scanned. But the importance of the original
source of the data is critical. As mentioned previously, in the taxonomic
world, nomenclaturists value the date of the printed page. The exact date
that a species is named is extremely important (see the history of T. Rex;
Breithaupt, Southwell, & Matthews, 2005). It was decided that BHL will not
allow for Frankenbooks because it was too important to be able to trace the
digitized page back to the physical page with a specific date. If a title cannot
be scanned because of missing pages, then that physical piece is rejected.
Attempts are made to find another copy of the book at a partner library.
?Frankenserial? runs were deemed acceptable. Remembering the practi-
cality of feeding the beast and the mission of providing access to our users,
gaps in serial runs would have to be filled in by one of the partner li-
braries. Unlike a monographic title, serial runs are extremely difficult to find
as complete sets in one library?s holdings. Due to copyright restrictions and
institutional policy, NH London will scan a title up to and including the year
1860, although if it can, through a due diligence process, obtain clearance
F
IG
U
R
E
3
T
yp
ic
al
p
ac
k
in
g
lis
t
w
it
h
ex
am
p
le
s
o
f
b
ib
lio
gr
ap
h
ic
an
d
it
em
-l
ev
el
d
at
a.
147
148 S. C. Pilsk et al.
FIGURE 4 Serial record in BHL portal showing items with full enumeration and chronology
data. Duplicate volumes indicate contributions from two libraries.
to scan a title, it will do so for titles beyond that date; MCZ Ernst Mayr Li-
brary will not scan beyond 1908 for non-U.S. publications. Therefore, the rest
of the title is scanned by other BHL partners?copyright limitations for the
partners allow scanning up to 1923. Missing volumes or volumes physically
too fragile or otherwise not fit for scanning are requested from partners for
filling in. These procedures have been extremely successful in enabling us to
overcome the various random binding decisions of each of the BHL partners
throughout their history.
FOLDOUTS
Older natural history serial titles include the ?pop up window? of their
day?the foldout. While the BHL scanning partner, the Internet Archive, had
developed a successful book (?cardboard to cardboard?) scanning workflow,
they had a learning curve to tackle with respect to how to deal with older
serials volumes with foldouts. Foldouts range from a simple one-inch folded
extension of a page, to several feet in length and width built out of the
binding of a volume. The Internet Archive scanning stations are designed to
hold a book in a cradle and shoot each open page simultaneously using two
overhead mounted cameras. Without a process in place to scan large scale
foldouts, the BHL began scanning operations holding in reserve every vol-
ume that was published with foldouts. This required keeping track of titles
or specific volumes that contained foldouts so that these volumes could be
retrieved for scanning when the Internet Archive had the facility to do so.
This situation contributed to numerous silos of metadata and, in some cases,
separate shelving for waiting volumes. When Internet Archive developed
Biodiversity Heritage Library 149
a scanning station for foldouts one year into the project, these reserve vol-
umes were then scanned and previously skipped foldouts were digitally
stitched into the originally scanned volumes. Thus, all of the data contained
in the foldouts was accounted for. The monographic tracking system had
to be updated to indicate that these previously ?on hold? items were now
complete. This is a notable example of moving ahead with a project before
all of the essential tools and work flows have been tested and are in place.
The pressure of moving the project forward was what was needed to resolve
this issue.
QUALITY CONTROL
Quality control is significant for the undertaking of any digital initiative.
A number of the main practical issues involved in any scanning project
are quality control of metadata, images, scanned book volumes, and every
nuance involved in the conversion of thousands of books into millions of
scanned pages.
Questions arise, such as: What do you do if the book left your library
to be scanned with one title attached to it on the spreadsheet, and the book
ended up digitized attached to another title? What if the scanned page was
blurry, or words were cut out of the image? These problems were addressed
by the BHL librarians in various targeted ways. One method developed to
examine scans was to take a statistical sampling of volumes scanned and
compare the page by page quality of the scanned pages to the actual in
hand paper copy. At the same time, captured metadata was examined as
well. When metadata diacritics were a problem, this was communicated to
other BHL librarians, and the Internet Archive staff was quickly informed
of the issue. We have since solved the issue to ensure that discovery is not
jeopardized due to metadata corruption. To achieve this coordinated effort, a
number of quality control conferences were held, face to face, via conference
calls, and via a dynamic wiki page interface.
TRACKING PROBLEMS
Initially, errors found by librarians and the public were reported and dis-
cussed via email. The volume of exponentially growing threads on many
issues through this type of communication quickly became overwhelming.
In a solid step forward, borrowing from the technology field, the BHL staff
instituted a Web-based error tracking system for communicating metadata
errors and quality control issues. The Gemini system9 allows either a library
user or a librarian to create a ?ticket,? which is then reviewed by a quality
control librarian who assigns the ticket to the appropriate librarian at the
150 S. C. Pilsk et al.
appropriate institution for issue resolution. The data associated with any is-
sue and its resolution is preserved in an electronic issue resolution system
for future reference. In practice, these types of tools have been in use by the
technology field to track bug problems in programming but had not been
adapted to the world of digital librarianship. This remarkable system is a
technical-human interface tool that works well and will guide the resolution
of many metadata issues for years to come.
POSTSCANNING METADATA WORK
BHL staff, like many catalogers, prefers that metadata be created once and
repurposed as needed??touched once? in essence. But, as documented
above, there are many opportunities in the complex scanning work flow
for errors, omissions, and unintended missing data to occur. In many cases
it is typical that some postscanning editing needs to be performed by staff
members. The BHL portal, through which the library?s scanned content is
viewed, is our only access point to book-associated metadata. As the BHL
was developed using open access technology solutions, it is not as robust as
most high-end integrated library systems. It lacks the traditional search and
editing functionalities a librarian would like to see. The current BHL portal
technical infrastructure was established as a viewing system and inventory
of scans enabling computer-to-computer uses, such as data mining, and con-
nection to services such as the Taxon Name Finder (also discussed in detail
below). Taxon Name Finder, an application developed at the MBLWHOI
Library, mines OCR?ed text to return all the scientific names appearing on
each page of literature in the BHL. This lack of metadata editing and field
controls has been challenging to learn how to resolve.
Described above, a Frankenserial is the result of two or more BHL part-
ners contributing scanned volumes of a single journal title. After scanning,
serial records need to be merged into one metadata record and the volumes
must be sequenced. For example, SIL contributes volumes 1 through 66 of
Tijdschrift voor Entomologie. Volumes 67 through 142 are picked up by MCZ
Ernst Mayr Library and so forth. With each institutional contribution to this
title, a MARC record is added to the portal. In this example, two records
for the same serial title now reside in the database. There should be only
one BHL record with all contributed volumes so that users do not encounter
multiple hits when searching for a title. Administrative editing allows for the
merging of the titles and sequencing of volumes. Quick roll out of new tools
and administrative editing capabilities has allowed the BHL to accommodate
the formerly unconsidered Frankenserials.
Other tweaks and adjustments made quickly to the editing functionalities
included the need to provide proper citation resolving and discovery points.
Occasionally, the host library does not have all the information regarding
Biodiversity Heritage Library 151
the enumeration and chronology of each volume scanned. In such cases the
WonderFetch spreadsheet had incomplete information. Each one of those
volumes that are added to the BHL collection needs attention to indicate
the proper volumes and years associated with the scans. Other typical edits
include adding access points for additional authors that are requested by the
users of BHL but were lacking in the original bibliographic description, page
number (described below), and other types of mergers of metadata records
to point to complete sets of titles.
THE MONOGRAPHIC SERIES PROBLEM
Monographic series present a unique problem. For example, the Field Mu-
seum contributed their publication Fieldiana to the BHL. Librarians at the
Field Museum have analyzed each number (issue) and have thus contributed
to the BHL, a fully analyzed monographic record for each number of the se-
ries. However, the database should also have a serial record for Fieldiana
so that users may search that title and browse all numbers of the series.
Toward this end, BHL programmers added portal editing functionality that
allows librarians to associate a scanned item (e.g., an issue, number, or
volume) with more than one bibliographic record. Using the example of
Fieldiana, a serial record can be uploaded to the portal, after which all
scanned numbers may be associated and displayed with that serial record
while remaining linked also to their original monographic records. Likewise,
volumes originally loaded under a serial title such as Memoirs of the Museum
of Comparative Zoology may be subsequently associated with monographic
records. Associating scans with two or more titles is accomplished manually
and is part of the work flow for monographic series. Thus, volumes of a
monographic series may be discovered by a search for the serial title or for
the individual monographic titles.
PAGINATOR
As a title is scanned by the Internet Archive scanning technician, page num-
bering is asserted. When doing the initial quality review of the scans, the
technician can indicate the beginning page number and have the system
auto generate the pages numbers. With some attention to detail, most of the
traditional printed materials are ?published? live to BHL with page numbers
indicated. But the BHL project has many titles that do not fit into the tradi-
tional mold. Users of the site will notice that the system attempts to determine
whether the page being viewed is text or images, although a page number
is not given. Assigning page numbers within a text is a labor-intensive hu-
man intervention solution. Missouri Botanical Garden program staff have
152 S. C. Pilsk et al.
developed a client application called the Paginator. There is a Web version
as well for those with administrative editing privileges on the BHL portal.
After logging into the administrative portion of the BHL, a scan is chosen
and opened. Each page is then opened and a number is assigned. In the end
of the process, each page has been assigned a literal page number in the se-
quence of scans and the page assertion for resolving citations. For example,
the sixth page in the scan might actually be page iv of the book since initial
scans will be the cover, title page, verso of the tile page, et cetera. An impor-
tant aspect of pagination is correctly asserting page types?another function
of the Paginator. For example, pages may be designated as text, illustra-
tion, issue start, foldout, index, et cetera, assisting navigation and discovery
within a digitized work. This is especially important for taxonomic literature
because investigators are often seeking specific illustrations of organisms.
The Paginator allows free-text description of pages in addition to asserting
page types. Thus, illustrations or foldouts may be described in some detail
if so desired. Again, this is a manual process and thus very labor-intensive.
OPEN URL
BHL has focused development efforts to provide access to and re-
solving of citations. The BHL?s OpenURL query interface is available
at http://www.biodiversitylibrary.org/openurl. Both OpenURL 0.1 and
OpenURL 1.0 queries are supported. The table summarizing the parameters
that are accepted by the OpenURL 0.1 and 1.0 query interfaces is available at
http://www.biodiversitylibrary.org/openurlhelp.aspx. By default, the query
interface will (if possible) redirect to the BHL page containing the citation
described by the query. If more than one possible citation is found, the
query interface redirects to a page from which the appropriate citation can
be selected. There are several additional ways that results from the query
interface can be returned: JSON, XML, and HTML. If results are returned
as JSON, a callback function may also be specified by adding a ?callback?
argument to the query.
TAXONOMIC INTELLIGENCE
BHL uses TaxonFinder,10 a taxonomic intelligence tool developed by collab-
orators at uBio.org,11 to locate and identify scientific species names within
the text of digitized books. This names-based index is an incredibly valu-
able tool for research on organisms and specific genus and species. It is
easily incorporated into external Web sites. A full bibliography of all the
titles contained in BHL that have a specific species named can be gener-
ated by a stable url that launches the species name search. To easily link
Biodiversity Heritage Library 153
into a list of all pages containing a given scientific name, follow the pattern
http://www.biodiversitylibrary.org/name/Scientific name. The example be-
low will provide an up to date result that includes the most recent scans of
material that mention the typical goldfish.
Example: Typical goldfish scientific name Carassius auratus
http://www.biodiversitylibrary.org/name/Carassius auratus
For computer to computer calling the name service has been established on
the BHL portal. The name services are XML-based Web services that can be
invoked via SOAP or HTTP GET/POST requests. Responses can be received
in one of three formats: XML wrapped in a SOAP envelope, XML, or JSON.12
BHL has yet to resolve the need to translate the citation abbreviations
that taxonomists use. With the help of TaxonFinder, BHL exposes to the
users a way to access the specific species. The particularly challenging issue
of resolving journal title abbreviations has yet to be solved.
TOWARD A GLOBAL BHL
This paper provides an overview of the BHL project, the support it offers to
vital work in the taxonomic field, and library metadata challenges and ad-
vancements. The paper explains metadata work flow that parallels the work
flow of the physical items from shelf to scanner and beyond to delivery of
the information to the Web. Tools, including the Monographic Deduper and
the Serial Bid list, were created to deal with the issues surrounding choosing
materials to be scanned. Work flow to enhance the data supplied to the
BHL portal included the development of WonderFetch and the Paginator.
The work conducted so far has been successful because of the collabora-
tive spirit among the team members. We have accomplished the creation
of a body of digital material already used in the daily work of scientists.
The availability of OCR text and computer-to-computer interface with tools
like TaxonFinder has made the data within the literature usable. We have
identified areas to approach next to ensure quality of the collection, ease of
discovery, and reuse of the data locked on the pages of the centuries-old
literature and to provide the metadata to a wider community.
The BHL project has been expanding globally because biodiversity is
an increasingly important topic. There are many converging elements (envi-
ronmental, climatic, biotic, and agricultural) that are turning the often over-
looked and underfunded science of systematic taxonomy into a globally
relevant topic of interest.The quick successes of the BHL in providing access
to vast amounts of taxonomic literature have stimulated others around the
world to contribute to the project. First out of the gate was BHL-Europe in
154 S. C. Pilsk et al.
May 2009. This European Union-funded project consists of 26 institutions
from across the EU. Partnering with BHL ?classic,? BHL-Europe is providing
data storage and additional technical development.13
In late 2009, the Chinese Academy of Sciences and BHL signed an
agreement for BHL-China. BHL-China describes itself thus:
Chinese Biodiversity Heritage Library (BHL-China), the pre-research
project funded by the Biodiversity Committee, Chinese Academy of Sci-
ences, is aiming to, through collaboration with BHL (Biodiversity Heritage
Library) and in conjunction with other institutes (colleges) on biologi-
cal research, jointly build a network platform for BHL-China; through
the comprehensive collection, scanning, extraction of the essential bio-
diversity related literature and the systematical arrangement of the im-
portant biodiversity (early focus on botany) literature, to establish an
easily-searchable and communitive network platform, while to make the
data API compliant and therefore provide documentation data services
to biodiversity (including EOL China nodes, Chinese Virtual Herbarium,
etc.) and other related research fields.14
Late 2009 and early 2010 saw the beginnings of even more globalization
of BHL. An initial meeting was held at the Bibliotheca Alexandrina for an
Arabic-language BHL. In early 2010 organizational meetings were convened
for a BHL node based in Brazil. Discussions are underway with the Atlas of
Living Australia for BHL-Australia.
The work pursued via BHL is forging new groups and is modernizing
the solution discussed at the British Museum in 1847, when Charles Dar-
win and a group of scientific luminaries commented: ?The cultivation of
natural science cannot be efficiently carried on without reference to an ex-
tensive library? (Darwin et al., 1847). Today, Darwin?s ?extensive library? is
an increasingly global virtual library designed by the most forward thinking
librarians, scientists, and informaticians. This library is freely available to re-
searchers around the world, and at the service of those studying life in its
myriad forms.
NOTES
1. International Commission on Zoological Nomenclature?s International Code, http://www.iczn.
org/iczn/index.jsp
2. Article 8 and recommendations in Article 8 of the International Commission on Zoological
Nomenclature?s International Code, http://www.iczn.org/iczn/index.jsp
3. Some fields, such as botany, tend to reference an earlier edition dated 1753.
4. Encyclopedia of Life, http://www.eol.org
5. CakePHP, http://cakephp.org/
6. The open source code for the BHL Monographic Deduper can be found at http://github.com/
woodshole/BHL-dedup
7. Details of WonderFetch and other workflow issues are documented here: http://biodivlib.
wikispaces.com/Workflow
Biodiversity Heritage Library 155
8. A slight variation on the theme of WonderFetch at each BHL-participating library has been
adapted to ensure proper workflow. SQL databases and MARC data dumps have been implemented.
9. Gemini is a product of CounterSoft (http://www.countersoft.com/home.aspx)
10. TaxonFinder is available at http://www.ubio.org/index.php?pagename = xml services
11. uBio.org services are available at http://www.ubio.org/
12. A full description of these services is available at https://docs.google.com/Doc?id=dgvjvvkz
1?5qbm3
13. BHL-Europe, http://www.bhl-europe.eu
14. BHL-China, http://www.bhl-china.org/cms/node/25
REFERENCES
Breithaupt, B. H., Southwell, E. H., & Matthews, N. A. (2005). In celebration of
100 years of tyrannosaurus rex; manospondylus gigas, ornithomimus grandis,
and dynamosaurus imperiosus, the earliest discoveries of tyrannosaurus rex in
the west; geological society of america, 2005 annual meeting. Abstracts with
Programs?Geological Society of America, 37(7), 406.
Darwin, C. R., Murchison, R. J., Buckland, M., Egerton, P. G., Greenough, G. B.,
& Owen, R. (1847). Copy of memorial to the first Lord of the Treasury [J.
Russell], respecting the management of the British museum. Parliamentary
Papers, Accounts and Papers, 24.253 (Paper No. 268), 1?3.
Gray, J. E. (1850). Catalogue of the specimens of mammalia in the collection of the
british museum. London, UK: Trustees of the British Museum of Natural History.
Moritz, T. (2005). Library & laboratory: Civil union??? (slide 26). Unpublished
manuscript. Retrieved from http://barcoding.si.edu/LibraryAndLaboratory/3?
11 Moritz.pdf
Sherborn, C. D. (1932). Index animalium. Cambridge, UK: The Trustees of the
British Museum. Retrieved from http://www.archive.org/details/sil34 02 29