ICES Journal of Marine Science (2021), doi:10.1093/icesjms/fsab082
Contribution to the Themed Section: ‘Patterns of Biodiversity of Marine Zooplankton Based
on Molecular Analysis’
The role of taxonomic expertise in interpretation of
metabarcoding studies
Paula Pappalardo 1*, Allen G. Collins 1,2, Katrina M. Pagenkopp Lohan 3, Kate M. Hanson 4,
Sarit B. Truskey 1,5, William Jaeckle 6, Cheryl Lewis Ames 1,7, Jessica A. Goodheart 1,
Stephanie L. Bush 1,8,9, Leann M. Biancani 1, Ellen E. Strong 1, Michael Vecchione 1,2,
M. G. Harasewych 1, Karen Reed 1, Chan Lin 1, Elise C. Hartil 10, Jessica Whelpley11,
Jamie Blumberg 6, Kenan Matterson 12, Niamh E. Redmond 13, Allison Becker 13,
Michael J. Boyle 14,†, and Karen J. Osborn 1,†
1Department of Invertebrate Zoology, Smithsonian National Museum of Natural History, Washington, DC, USA
2National Systematics Laboratory, NOAA’s National Marine Fisheries Service, Smithsonian National Museum of Natural History, Washington, DC, USA
3Marine Disease Ecology Laboratory, Smithsonian Environmental Research Center, Edgewater, MD, USA
4Biology Department, Duke University, Durham, NC, USA
5Marine Science Center, Northeastern University, Nahant, MA, USA
6Biology Department, Illinois Wesleyan University, Bloomington, IL, USA
7International Integrative Research and Instruction, Graduate School of Agricultural Science, Tohoku University, Sendai, Japan
8Monterey Bay Aquarium Research Institute, Moss Landing, Midwater Ecology Lab, CA, USA
9Monterey Bay Aquarium, Animal Husbandry Department, Monterey, CA, USA
10School of Marine Sciences, University of Maine, Orono, ME, USA
11Whitney Laboratory for Marine Bioscience, University of Florida, St. Augustine, FL, USA
12Department of Biological, Geological, and Environmental Sciences, University of Bologna, Ravenna, Italy
13Smithsonian Institution Barcode Network, Smithsonian National Museum of Natural History, Washington, DC, USA
14Smithsonian Marine Station, Smithsonian National Museum of Natural History, Fort Pierce, FL, USA
*Corresponding author: e-mail: paulapappalardo@gmail.com.
†The last two authors are co-senior authors.
Pappalardo, P., Collins, A. G., Pagenkopp Lohan, K. M., Hanson, K. M., Truskey, S. B., Jaeckle, W., Ames, C. L., Goodheart, J. A., Bush, S. L.,
Biancani, L. M., Strong, E. E., Vecchione, M., Harasewych, M. G., Reed, K., Lin, C., Hartil, E. C., Whelpley, J., Blumberg, J., Matterson, K.,
Redmond, N. E., Becker, A., Boyle, M. J., and Osborn, K. J. The role of taxonomic expertise in interpretation of metabarcoding studies. – ICES
Journal of Marine Science, doi:10.1093/icesjms/fsab082.
Received 29 December 2020; revised 5 April 2021; accepted 5 April 2021.
The performance of DNA metabarcoding approaches for characterizing biodiversity can be influenced by multiple factors. Here, we used mor-
phological assessment of taxa in zooplankton samples to develop a large barcode database and to assess the congruence of taxonomic identi-
fication with metabarcoding under different conditions. We analysed taxonomic assignment of metabarcoded samples using two genetic
markers (COI, 18S V1–2), two types of clustering into molecular operational taxonomic units (OTUs, ZOTUs), and three methods for taxo-
nomic assignment (RDP Classifier, BLASTn to GenBank, BLASTn to a local barcode database). The local database includes 1042 COI and 1108
18S (SSU) barcode sequences, and we added new high-quality sequences to GenBank for both markers, including 109 contributions at the
VC International Council for the Exploration of the Sea 2021.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/
licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is prop-
erly cited.
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
2 P. Pappalardo et al.
species level. The number of phyla detected and the number of taxa identified to phylum varied between a genetic marker and among the
three methods used for taxonomic assignments. Blasting the metabarcodes to the local database generated multiple unique contributions to
identify OTUs and ZOTUs. We argue that a multi-marker approach combined with taxonomic expertise to develop a curated, vouchered, lo-
cal barcode database increases taxon detection with metabarcoding, and its potential as a tool for zooplankton biodiversity surveys.
Keywords: barcode, biodiversity, Gulf Stream, holoplankton, meroplankton, metabarcode, zooplankton
Introduction Linnaean ranks (Guillou et al., 2012). The Midori database
Recent studies suggest that metabarcoding could eventually re- includes sequences from GenBank with taxonomic information
place traditional morphological methods for biodiversity surveys at the species level for 13 protein-coding genes (including COI)
and ecosystem assessment (Lejzerowicz et al., 2015; Aylagas et al., and two ribosomal RNA (lrRNA and srRNA; Machida et al.,
2016; Elbrecht et al., 2017; Lobo et al., 2017; Carew et al., 2018). 2017). The stringent filtering of GenBank data improves data reli-
By using DNA-identification to analyse a mixture of unidentified ability in Midori, but greatly reduces the representation of known
species, metabarcoding offers the potential to dramatically reduce species diversity (Machida et al., 2017). While not often quanti-
the time and cost of biodiversity surveys while increasing detec- fied, similarly uneven and incomplete representation of extant
tion of taxa that are difficult to identify based on morphology taxa should be expected for other curated databases.
alone (e.g. invasive species, Abad et al., 2016; cryptic species, The identification at any level from population to phylum of
Aylagas et al., 2016; parasites, Pagenkopp Lohan et al., 2016). metabarcoding sequence profiles depends not only on the com-
However, in many environments, it remains to be demonstrated pleteness of the reference database but also on the algorithm
that metabarcoding can perform better than traditional morpho- used. Following Bazinet and Cummings (2012), these algorithms
logical assessments by taxonomic experts. The performance of can be classified as (i) similarity-based (e.g. programs using
metabarcoding approaches is highly dependent on multiple fac- BLAST), (ii) composition-based (e.g. programs using Naive
tors, including how the initial samples are processed (Ransome Bayes Classifier), and (iii) phylogeny-based (e.g. programs using
et al., 2017; Pagenkopp Lohan et al., 2019), the choice of primers maximum likelihood or Bayesian methods). Even within a cate-
(Lobo et al., 2017), PCR profiles (Aylagas et al., 2016; Clarke gory, the performance of different bioinformatics pipelines can
et al., 2017), genetic markers (Pitz et al., 2020), the completeness vary widely when applied to the same dataset (Bazinet and
of the DNA sequence reference database (Lindeque et al., 2013), Cummings, 2012; Pitz et al., 2020). In addition, the resolution
and even the software used for taxonomic assignment (Bazinet and error rate of taxonomic assignments can vary for different ge-
and Cummings, 2012). netic markers (Richardson et al., 2017; Pitz et al., 2020). Opinions
Even though traditional biodiversity surveys based on mor- on which markers work best vary greatly even for the same taxo-
phological assessments are time consuming and depend on the nomic group (Bhadury et al., 2006; Tang et al., 2012). In studies
level of taxonomic expertise available, they may capture relevant targeting taxonomically diverse assemblages, there is agreement
details for population and community analyses (such as sex, life that a multi-marker approach is more effective (Cowart et al.,
history stage, and relative abundance of taxa, Lindeque et al., 2015; Bucklin et al., 2016; Djurhuus et al., 2018), but continued
2013). Most importantly, taxonomic experts can contribute DNA assessment of different methods and markers for taxonomic as-
sequences to reference databases based on confidently identified signment remains critical to enable metabarcoding data for stud-
and well-documented specimens deposited in a vouchering insti- ies of biodiversity and ecological research.
tution. When paired with metabarcoding, this traditional ap- A more fundamental methodological choice in any metabar-
proach can dramatically reduce the number of unidentified (or coding study is to determine how to generate the final set of
misidentified) sequences (Ransome et al., 2017), which in many sequences for taxonomic identification. Sequences are usually
studies, comprise a large part of a community sample (e.g. 93% clustered into operational taxonomic units (OTUs), using a de-
of taxa for the COI marker in Cowart et al., 2015). These uniden- fined similarity threshold (typically 97%) that is likely to repre-
tified sequences may actually represent species that are new to sci- sent different species. But recent studies have suggested that
ence; however, most often they represent known taxa that are not individual unique amplicons (sequences amplified from an envi-
represented in current reference DNA-sequence databases. ronmental DNA sample) make metabarcoding more reproducible
Therefore, experts agree that continued contribution of vouch- and better represent the full biological diversity within environ-
ered sequences to existing reference databases will be a major step mental samples (Callahan et al., 2017; Edgar, 2018). These unique
to fulfil the promise of metabarcoding approaches (Cristescu, sequences have been called amplicon sequence variants (ASVs) or
2014; Cowart et al., 2015; Bucklin et al., 2016; Ransome et al., zero-radius OTUs (ZOTUs) and capture both inter-specific and
2017; Porter and Hajibabaei, 2018). intra-specific variability. While the full impact of this choice on
Several curated databases have been developed in parallel to derived research questions is still being assessed, it affects the
the implementation of metabarcoding methods, emphasizing spe- number of taxa detected (Schenk et al., 2020).
cific taxonomic groups or genetic markers (e.g. BOLD, Our study combined expert taxonomic knowledge with stan-
Ratnasingham and Hebert, 2007; SILVA, Quast et al., 2012; see dard DNA barcoding and examines how metabarcoding results
Table 2 of Porter and Hajibabaei, 2018 for a summary of data- can be improved by using taxonomic expertise to develop a cu-
bases commonly used in metabarcoding studies). PR2 and rated database of vouchered DNA barcodes for multiple genetic
Midori are additional curated databases that were not assessed in markers. We focused on a challenging yet commonly sampled as-
Porter and Hajibabaei (2018). The PR2 database includes rRNA semblage, marine zooplankton communities. Given their broad
sequences with a curated taxonomy structured to conform with taxonomic diversity, their roles as critical members of ocean food
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
Taxonomic expertise improves metabarcoding 3
webs, their many morphologically distinctive early life stages, and MA) for extraction, PCR amplification and Sanger sequencing
the extensive taxonomic expertise required for accurate morpho- (see DNA Barcoding below). All morphological and image
logical identifications, marine invertebrate zooplankton are a nat- vouchers generated by the StreamCode project (Supplementary
ural target for diversity surveys using a metabarcoding approach. Table S1), as well as tissues and DNA aliquots, are available in the
Specifically, we compared the resulting taxonomic identification collections of the National Museum of Natural History
of metabarcoding samples analysed using: (i) different markers, (NMNH).
i.e., either COI or 18S V1–2, (ii) different bioinformatic pipelines, After each of the samples was sequenced for COI and 18S V1–
and (iii) different operational taxonomic units (OTUs and 2 regions, a two-pronged approach was taken to verify and refine
ZOTUs). The different methods were compared among each the initial identifications: (i) BLAST and alignment based and (ii)
other, since the true community composition is unknown. To ac- phylogeny based. Both approaches were run on the Smithsonian
complish these goals, the StreamCode Project (a collaborative ef- Institution’s High Performance Computing Cluster (https://doi.
fort led by researchers in the Smithsonian National Museum of org/10.25572/SIHPC), see Supplementary material for additional
Natural History), made multiple collections of plankton from a details. For the BLAST approach (Altschul et al., 1990), each se-
specific geographic area and developed one of the largest curated quence was compared against sequences in GenBank and personal
zooplankton databases that includes DNA barcodes. We analyse reference libraries to verify taxon assignments. For the phylogeny-
if this regional database can help to identify organisms to higher based approach, we combined target sequences, related sequences
taxonomic levels such as phylum, and also if it can increase the from GenBank, and sequences from personal libraries to build
taxonomic resolution necessary to classify zooplankton into eco- phylogenetic trees and to determine the placement of each target
logical plankton type to identify permanent (holoplankton) or sequence in relation to known sequences. Based on the initial field
temporary (meroplankton) residents of the plankton. IDs, results of the BLASTn searches, the phylogenetic trees, photo-
graphs, and voucher specimens, taxonomic experts assigned a final
Methods ID to each sample (Supplementary Table S1).
StreamCode project plankton collection
Twelve plankton tow samples were collected from the Florida DNA barcoding
Current of the Gulf Stream off the Atlantic coast near Fort Pierce, Detailed sequencing methods, primers used (Supplementary
Florida (27.45N 79.95W) in June and August 2017 between the Table S2), and quality control protocols are provided in the
surface and 77–145 m depth. To maximize the diversity of taxa Supplementary material. DNA extraction, amplification, and se-
recovered, we used both a circular mouth plankton net (mesh quencing were performed in the Smithsonian Laboratories of
size 209mm) and a modified mid-water trawl (1.0 m2 square Analytical Biology (LAB), NMNH. All 18S V1–2 and COI sequen-
mouth net with 500mm mesh). Upon collection and mixing, ap- ces produced in the current study, the StreamCode DNA Barcode
proximately one-quarter of each plankton sample was separated Database, were uploaded to GenBank (https://www.ncbi.nlm.nih.
for metabarcoding analysis. Each tow-specific subsample desig- gov/genbank/, NCBI BioProject PRJNA421480, see Supplementary
nated for the metabarcoding analysis was concentrated by pour- Table S1 for accession numbers).
ing it through a 50mm mesh nylon filter, rinsed with ultracold
95% ethanol into a 50 ml polypropylene tube, stored on dry ice StreamCode DNA barcode database
for transport to the Smithsonian Marine Station at Fort Pierce To identify novel contributions of the StreamCode DNA barcode
(SMSFP), and stored at 80C until processing. The leftover por- database to GenBank, we used the gap analysis tool developed by
tion of the mixed tow sample was diluted in at least 10 l of seawa- the Global Genome Initiative (GGI) for each refined ID
ter, kept aerated and chilled during transport to SMSFP, and used (Supplementary Figure S1) at the rank of species and higher taxo-
for live sorting of individuals. nomic levels (GGI Gap Analysis Tool, 2019; https://www.global
geno.me/gaps/live), GGI tool last accessed on 29 October 2020.
Taxonomic identification using morphology The live gap analysis tool searches GenBank for each name con-
Live plankton were hand-sorted based on taxonomic groups un- sidering only sequences of barcode quality, defined as GenBank
der stereomicroscopes at the SMSFP by a team of taxonomic sequence records for COI, rbcL, matK, and/or ITS with a se-
experts and students. Animals were selectively picked with the quence length > 500 bp that include information identifying a
goal of maximizing the diversity of samples collected from focal scientific name, a voucher specimen, and the amplification
groups across 15 invertebrate phyla (between 1 and 200 individu- primer sequences (GGI Gap Analysis Tool, 2019). From the two
als per focal group, Table 1). For some groups (pteropods, holo- markers we used for this study (COI and 18S), only COI is offi-
planktonic polychaetes), selective sampling was able to capture cially considered a barcode and included in the Gap Analysis
most of the expected species diversity. For species-rich groups Tool; however, we modified this tool with an in-house Python
such as copepods, the sorting effort captured only a fraction of script to search for 18S sequences in GenBank separately for each
the expected species diversity, and experts targeted taxa not yet Phylum.
represented in GenBank. Live specimens were grouped by mor-
photype and classified to the lowest taxonomic level possible. DNA metabarcoding and bioinformatics protocol
Individuals and populations of live specimens were photo- DNA extraction, PCR amplification, and multiplex library prepara-
graphed in the laboratory at SMSFP (see photography details and tion of pooled amplicons were performed at the SMSFP; final
how to access images in the Supplementary material). The high- quantification and library-size determination steps for sequencing
resolution images allowed us to verify the initial identifications protocols were performed by staff at LAB. Raw metabarcoding
made in the field. Tissue samples or whole animals were placed data is available through NCBI Sequence Read Archive and can be
directly in 150ml TD-M2 tissue buffer (Autogen, Inc., Hollister, found in NCBI BioProject PRJNA421480 (https://www.ncbi.nlm.
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
4 P. Pappalardo et al.
Table 1. Barcoding success.
Number of Total number COI 18S % success % success % success
Phylum Target group morphospecies of individuals barcodes barcodes Individuals barcoded for both COI 18S both
Annelida Class Polychaeta 57 160 107 115 93 66.88 71.88 58.13
Arthropoda Subclass Copepoda 51 95 77 95 77 81.05 100 81.05
Suborder Hyperiidea 60 202 177 132 122 87.62 65.35 60.4
Miscelaneous arthropods 78 130 108 116 103 83.08 89.23 79.23
Class Ostracoda 18 91 53 70 41 58.24 76.92 45.05
Brachiopoda Phylum Brachiopoda 2 12 12 12 12 100 100 100
Bryozoa Phylum Bryozoa 3 7 6 7 6 85.71 100 85.71
Chaetognatha Phylum Chaetognatha 10 43 19 12 8 44.19 27.91 18.6
Chordata Class Appendicularia 7 24 5 16 4 20.83 66.67 16.67
Chordata Class Thaliacea 7 17 1 15 1 5.88 88.24 5.88
Cnidaria Class Anthozoa 8 27 20 20 17 74.07 74.07 62.96
Class Hydrozoa 38 103 58 86 57 56.31 83.5 55.34
Class Scyphozoa 4 19 17 17 17 89.47 89.47 89.47
Ctenophora Phylum Ctenophora 3 5 0 5 0 0 100 0
Echinodermata Class Asteroidea 6 30 28 27 25 93.33 90 83.33
Class Echinoidea 5 7 7 7 7 100 100 100
Class Holothuroidea 5 13 12 13 12 92.31 100 92.31
Class Ophiuroidea 11 26 17 23 17 65.38 88.46 65.38
Hemichordata Phylum Hemichordata 6 16 8 6 6 50 37.5 37.5
Mollusca Class Bivalvia 5 10 8 10 8 80 100 80
Class Cephalopoda 10 20 17 15 15 85 75 75
Class Gastropoda 51 96 76 73 66 79.17 76.04 68.75
Order Pteropoda 33 209 177 181 163 84.69 86.6 77.99
Nemertea Phylum Nemertea 1 1 0 0 0 0 0 0
Phoronida Phylum Phoronida 1 1 1 1 1 100 100 100
Sipuncula Phylum Sipuncula 19 35 31 34 31 88.57 97.14 88.57
Total 499 1399 1042 1108
Total number of morphospecies, total number of StreamCode samples collected for each taxonomic group, details of number of samples for each genetic marker that yielded useful sequences, and percentage of sam-
ples that were successfully sequenced for COI and 18S V1–2. Cells in red highlight the groups with less than 75% success. Miscellaneous arthropods include non-hyperiid amphipods, and members of the orders
Cumacea, Decapoda, Euphausiacea, Mysida, and Stomatopoda.
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
Taxonomic expertise improves metabarcoding 5
Table 2. Performance of methods to assign barcodes to phylum using the StreamCode DNA barcode database.
RDP classifier BLASTn to GenBank BLASTn to StreamCode
Marker StreamCode Phylum N barcodes % identified % wrong % identified % wrong % identified % wrong
COI Annelida 107 6.5 14.3 22.4 0 83.2 0
Arthropoda 415 78.8 0 74.2 0 90.1 0
Brachiopoda 12 NA NA NA NA 100 0
Bryozoa 6 NA NA NA NA 100 0
Chaetognatha 19 89.5 5.9 84.2 0 84.2 0
Chordata 6 NA NA NA NA 66.7 0
Cnidaria 95 74.7 0 72.6 0 86.3 0
Echinodermata 64 89.1 0 64.1 0 84.4 0
Hemichordata 8 12.5 100 NA NA 87.5 0
Mollusca 278 75.2 0 77 0 87.8 0
Phoronida 1 NA NA NA NA NA NA
Sipuncula 31 35.5 0 74.2 0 77.4 0
18S V1–2 Annelida 115 88.7 8.8 99.1 1.8 100 0
Arthropoda 413 87.2 0 96.9 0 99.5 0
Brachiopoda 12 100 0 100 0 100 0
Bryozoa 7 100 0 100 0 100 0
Chaetognatha 12 91.7 0 100 0 100 0
Chordata 31 100 0 100 0 100 0
Cnidaria 123 94.3 0 100 0 100 2.4
Ctenophora 5 100 0 100 0 100 0
Echinodermata 70 87.1 0 100 0 100 0
Hemichordata 6 100 0 100 0 100 0
Mollusca 279 95 0 99.6 0 99.3 0
Phoronida 1 100 0 100 0 100 100
Sipuncula 34 88.2 0 100 0 97.1 0
Total number of barcodes tested, percentage of barcodes identified to phylum, and percentage of barcodes with a wrong assignment for each marker and
StreamCode phylum. To accept an identification to phylum, the confidence threshold was equal to or larger than 0.90 with RDP Classifier, and percent similar-
ity was equal to or higher than 85% for BLASTn. The RDP Classifier was used with the Midori database for COI and with the PR2 database for 18S. NA (not ap-
plicable) are instances where no assignment passed the confidence threshold.
nih.gov/bioproject/?term¼PRJNA421480). Additional details on using BLASTn against the StreamCode DNA barcode database
primers (Supplementary Table S3), sequencing methods and bioin- (we refer to this approach as BLASTn-StreamCode).
formatics pipeline are provided in the Supplementary material. The RDP classifier (Wang et al., 2007) was implemented in
Reads were merged and filtered with USEARCH 10.0 (Edgar, QIIME (Caporaso et al., 2010) for the taxonomic assignment of
2013). Post-merging, the allowable sequence range was 400– the 18S sequences, selecting PR2 11.1 as the reference database
500 bp for 18S V1–2 and 300–400 bp for COI. Primer specific (Guillou et al., 2012, PR2 database includes rRNA sequences with
barcodes were demultiplexed in QIIME (Caporaso et al., 2010). curated taxonomy). For the COI sequences, the RDP Classifier
In USEARCH, primers were removed, and then unique sequences was implemented in the Midori server queried on 3 October 2020
were identified, de-replicated, and sorted by abundance. Because (http://reference-midori.info/server.php; Leray et al., 2018), using
the “species level” clustering threshold likely varies by taxonomic the Midori 2-LONGEST database as the reference (Machida et al.,
group, we took two clustering approaches: (i) clustering sequen- 2017; filtered database of mitochondrial-encoded genes for meta-
ces into operational taxonomic units (OTUs) at 97% similarity zoans). We accepted a phylum classification as correct when the
after removing singletons and (ii) clustering sequences into zero- confidence threshold was equal to or larger than 0.90. Based on
radius operational taxonomic units (ZOTUs), allowing at least the information available in Midori, we observed that using 0.90
four copies, which is recommended for smaller datasets (Edgar, as the confidence threshold for COI yields an error rate less than
2016). The number of metabarcode sequences in each filtering 1% for most phyla commonly found in zooplankton assemblages
step is detailed in Supplementary Table S4. The variability in the (http://reference-midori.info/download.php#; follow the path
number of OTUs per phylum with an alternative OTU filtering Archive/Leave-one-sequence-out_test_1.1/COI). We are not
(keeping only sequences with at least four copies) is presented in aware of similar specific guidelines for 18S using the PR2 data-
Supplementary Table S5. base, so we applied the same confidence threshold to 18S.
For the BLASTn-GenBank approach, we assigned taxonomy to
Assigning taxonomy to the metabarcode sequences the OTUs and ZOTUs by running a BLASTn search with default
We took three approaches to assign taxonomic identities to options on 6th October 2020. We filtered out matches lacking
OTUs and ZOTUs: (i) taxonomic assignment using the RDP specific taxonomic information using custom scripts in R (R
Classifier (Wang et al., 2007) with Midori or PR2 as reference Core Team, 2020, see Data Availability for code). For the
databases for COI and 18S, respectively; (ii) taxonomic assign- BLASTn-StreamCode approach, we created a reference database
ment using BLASTn against the NCBI nt database (we refer to with the StreamCode voucher-based barcodes for each marker
this approach as BLASTn-GenBank); (iii) taxonomic assignment and assigned taxonomy using BLASTn. A match to phylum was
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
6 P. Pappalardo et al.
accepted as correct if the percent similarity and sequence coverage resulting from the use of different genetic markers (COI, 18S
were equal to or larger than 85% (Ransome et al., 2017). At least V1–2), different types of clustering (OTUs, ZOTUs), and differ-
for COI, Ransome et al. (2017) showed that an 85% similarity ent approaches to assigning taxonomy (RDP Classifier with
threshold provides a relatively small error rate (0.7%) when MIDORI/PR2, BLASTn-GenBank, and BLASTn-StreamCode).
assigning sequences to phylum. We calculated the number (and proportion) of OTUs/ZOTUs for
At the beginning of this project, we intended to compare each phylum, genetic marker, and approach for a taxonomic
agreements at various taxonomic levels, but we did not find stan- assignment.
dardized recommendations. Initial explorations with our dataset
showed that we would need to generate specific thresholds of tax- Agreements and unique contributions among methods
onomic assignments for each marker, method, and taxonomic for assignment to phylum
group. For this reason, we focused our main analysis on the
For all the OTUs/ZOTUs, we quantified the agreement between
assignments to phylum.
methods for taxonomic assignment to phylum. We recorded
whether all methods agreed, only two agreed, or zero agreed.
Taxonomic matching between databases
Additionally, we kept records of how many agreements were to a
To standardize the taxonomic names across the morphological
specific phylum (e.g. Annelida) versus agreements in failure to
assignments and the different databases consulted in the metabar-
identify a phylum (e.g. Unidentified).
coding analysis (PR2, MIDORI, and GenBank), we followed the
For the taxonomic groups best studied by the StreamCode tax-
classification in the World Register of Marine Species (WoRMS,
onomic experts (Annelida, Arthropoda, Cnidaria, Mollusca, and
Horton et al., 2020). All names corresponding to marine organ-
Sipuncula), we combined the barcodes and OTUs into a phyloge-
isms were matched using the WoRMS online taxon match tool
¼ netic tree (details in Supplementary material). For the metabarco-(http://www.marinespecies.org/aphia.php?p match), last
des, we highlighted the cases in which two or three methods
accessed in October 2020. For matches to non-marine organisms,
agreed in the assignment and unique cases in which only one
we followed the NCBI taxonomy database (https://www.ncbi.
method was able to assign the sequence to a phylum (we refer to
nlm.nih.gov/taxonomy), last accessed in October 2020. When a
these cases as unique contributions).
phylum assignment of a sequence did not pass the corresponding
thresholds for each method, the phylum was labelled
“Unidentified.” Plankton types
Standardizing names across databases is important because the To investigate a categorical approach where it is necessary to
taxonomic differences between databases are not minor. To give match taxa with an ecological planktonic type, we classified
one example, the PR2 database considers Urochordata to be a OTUs/ZOTUs into meroplankton and holoplankton. Meroplankton
phylum, but the WoRMS classification considers Urochordata to consists of the life-history stages of benthic organisms that live
be subphylum Tunicata (within the Phylum Chordata). There is temporarily in the water column (e.g. eggs, larvae, and adults in
some debate about the placement of Sipuncula as a subgroup of the case of medusozoan cnidarians), whereas holoplankton refers
the phylum Annelida (Struck et al., 2011; Parry et al., 2016); to be to organisms that live permanently within the water column.
consistent in our analysis, we followed the WoRMS classification, Although in some cases all members of a phylum belong to the
as of November 2020, that considers Sipuncula a phylum. In ad- same plankton type (e.g. all Sipuncula were meroplankton), for
dition, there were a few cases in which the minimum taxonomic Annelida, Arthropoda, Chordata, Cnidaria, and Mollusca, we used
level reported in the output (by MIDORI or PR2) did not match customized taxonomic filters that involved taxonomic assignments
the phylum assigned by WoRMS. When that happened, we ran below the level of phylum (i.e. Class, Order, Family). In this analy-
independent BLASTn searches of each OTU/ZOTU and verified sis, we excluded OTUs/ZOTUs assigned as “Unidentified” or as
that the WoRMS assignment was correct. We enumerate those non-target phyla.
cases in the R code provided in the Dryad data package.
Results
Validating methods for a taxonomic assignment using StreamCode project
the StreamCode DNA barcode database The collection, handling, morphological identification, and pho-
To validate the performance of the taxonomic assignment to phy- tographing of zooplankton samples involved approximately
lum for the metabarcoding data, we ran the StreamCode COI and 1800 h of field work by 31 persons, or 58 person-hours. Of the
18S barcodes using the same three methods for a taxonomic as- 2260 specimens collected, representative individuals were cata-
signment detailed for the metabarcoding data: (i) RDP Classifier, logued (n¼ 1529; Supplementary Table S1), tissue sampled
(ii) BLASTn-GenBank, and (iii) BLASTn-StreamCode, without (n¼ 1399), photographed (n¼ 1149, Figure 1), and, when possi-
including the query sequence. As not all the barcodes were suc- ble, a morphological voucher was deposited into the National
cessfully identified at the quality thresholds specified, we calcu- Museum of Natural History collections (n¼ 403). Universal pri-
lated the percentage of barcodes that were identified, and from mers were effective for sequencing individuals of most taxonomic
the barcodes identified, we calculated the percentage of wrong groups, although there was considerable variation in success be-
assignments to phylum. tween the two genetic markers (COI, 18S V1–2) across the differ-
ent target groups (Table 1). The number of morphospecies
Variability in the metabarcoding results identified by the taxonomic experts for each target taxonomic
To examine the variability of taxonomic assignments associated group is reported in Table 1. The refined ID for each sample, cur-
with methodological choices when assessing a zooplankton as- rent classifications following the WoRMS taxonomy, voucher
semblage, we compared sequence assignments to the phylum level identification number (USNM), and GenBank accession numbers
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
Taxonomic expertise improves metabarcoding 7
and 18S within the phyla Echinodermata, Mollusca, and
Sipuncula (these could be related to taxonomic backbone differ-
ences, see Supplementary Figure S1); and for COI within the
phyla Annelida and Chordata (Supplementary Figure S1). We
also contributed COI sequences for 33 families (from 6 phyla)
previously not represented, and 11 for 18S; sequences were con-
tributed for 66 genera (from 8 phyla) previously not represented
for COI, and 33 for 18S; and sequences were contributed for 63
species previously not represented for COI, and 46 for 18S
(Supplementary Figure S1). Overall, the largest number of se-
quence contributions were for the phyla Arthropoda, Cnidaria,
and Mollusca.
Validating taxonomic assignment methods using the
StreamCode DNA barcode database
The percentage of StreamCode barcodes identified to phylum
ranged from 0 to 100%, depending on the genetic marker and the
method used to identify taxa (Table 2). For both markers and
most phyla, the BLASTn-GenBank method identified a larger per-
centage of barcodes than the RDP Classifier, and in most cases at
a smaller or zero error rate (Table 2). Using the local database
(BLASTn-StreamCode) improved the percentage of identified
taxa for most phyla, with the largest improvement observed with
COI for the phylum Annelida (Table 2). Not all the barcodes
identified to a phylum were assigned to the correct phylum; the
percentage of wrong assignments varied with marker and method
but was in general larger for COI than for 18S. All of the
StreamCode phyla were identified using the 18S barcodes, with
only small errors in the assignment to phylum for Annelida and
Cnidaria (Table 2), and a failure to correctly identify Phoronida
using the BLASTn-StreamCode approach. Since there was only
one StreamCode DNA barcode available for the phylum
Figure 1. Examples of StreamCode specimens. (a) Pelagobia sp.
Phoronida, when that sequence was queried there were no other
(Polychaeta), USNM 1450035; (b) Maupasia sp. (Polychaeta), USNM
1450037; (c) Peachia sp. (Cnidaria), USNM 1448832; (d) Cytaeis sp. sequences from the phylum Phoronida for comparison.
(Cnidaria), USNM 1447971; (e) Phronima sp. (Arthropoda), USNM
1450286; (f) Clio recurva (Mollusca), USNM 1448342; (g) Euchirella Variability in metabarcoding results
curticauda (Arthropoda), USNM 1448593; (h) Otoporpa sp. The total number of OTUs/ZOTUs identified to phylum differed
(Cnidaria), USNM 1448490; (i) Copilia sp. (Arthropoda), USNM by marker, and methods used for a taxonomic assignment
1448598; (j) Abralia veranyi (Mollusca), USNM 1447996. (a–d) (Table 3). Regardless of the type of clustering or method for a
Highlight of four StreamCode samples that represent new COI
contributions of barcode quality at the genus level not previously taxonomic assignment, the proportion of unidentified taxa was
represented in GenBank. smaller for 18S V1–2 than COI (18S: 15.3–40.8%; COI: 34.7–
75.0%). In general, the BLASTn-GenBank method returned a
smaller number of unidentified taxa and identified a larger num-
for the samples sequenced successfully are also included in
ber of non-target phyla when compared with the other methods
Supplementary Table S1.
(Table 3).
Phylum detection at the defined taxonomic thresholds varied
Contributions to GenBank by marker, type of clustering, and method for a taxonomic as-
The gap analysis allowed us to identify taxa for which signment. The phyla detected by all metabarcoding method com-
StreamCode contributed new COI or 18S sequences to GenBank binations (and by the morphological analysis) were Annelida,
as of 29 October 2020 (taxon names are detailed in Arthropoda, Bryozoa, Chaetognatha, Chordata, Cnidaria,
Supplementary Figure S1). The COI results were specific to con- Echinodermata, Mollusca, and Sipuncula. Other phyla were
tributions of barcode quality (e.g. Flaccisagitta enflata will count detected only by some combinations: Acanthocephala using 18S
as a new contribution because the available sequence in GenBank V1–2; Porifera using COI with BLASTn-GenBank and by mor-
is shorter than 500 bp), but the 18S results represent any contri- phological analysis; Nematoda using OTUs by BLAST-GenBank
bution for that genetic marker without a predefined quality filter. and RDP Classifier. By definition, the phyla not represented in
For COI, we contributed sequences for three classes that were the StreamCode DNA barcode database (Acanthocephala,
previously not included in GenBank (Appendicularia and Nematoda, Nemertea, and Platyhelminthes) were unidentified us-
Thaliacea within phylum Chordata, and Phascolosomatidea ing the BLASTn-StreamCode approach.
within phylum Sipuncula). No new class-level contributions were The cases in which the only one of the methods used for a tax-
made for 18S. Order level contributions were made for both COI onomic assignment was able to identify a phylum (unique
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
8 P. Pappalardo et al.
Table 3. Summary of metabarcoding results.
Marker Clustering Taxa RDP classifier BLASTn-GenBank BLASTn-StreamCode
COI OTUs Zooplankton 1661 1986 1234
Unidentified 3283 2817 3712
Non-target 2 143 0
ZOTUs Zooplankton 5168 5910 4646
Unidentified 4080 3206 4603
Non-target 1 133 0
18S V1–2 OTUs Zooplankton 1892 1914 2935
Unidentified 1468 1453 663
Non-target 238 231 0
ZOTUs Zooplankton 2252 2570 2744
Unidentified 781 526 707
Non-target 418 355 0
Number of OTUs/ZOTUs identified to target zooplankton phylum, unidentified to phylum, or identified to non-target phylum for each genetic marker, type of
clustering, and method used for taxonomic assignment.
Figure 2. Unique OTUs contributions from each method for a taxonomic assignment (BLASTn-GenBank, BLASTn-StreamCode, RDP
classifier), for (a) COI and (b) 18S V1–2. The RDP Classifier was used with the PR2 database for 18S and MIDORI 2 database for COI. Upper
panels: counts for all the unique contributions to identify OTUs in each phylum. We added a constant of 0.5 when the number of unique
contributions was 1, to be able to represent those values in the logarithmic scale. “Non-target” refers to taxa identified to phyla that do not
belong to the target zooplankton groups. Lower panels: distance tree of all the barcodes and OTUs for Annelida and Arthropoda, coloured
according to the identification method for each genetic marker sequence. When two or three metabarcoding methods agreed in the
assignment to the phylum, we coded it “Agreement.”
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
Taxonomic expertise improves metabarcoding 9
contributions) varied by taxonomic group and genetic marker type, the second most diverse groups were chaetognaths, chor-
(Figure 2). For COI OTUs, the BLASTn-StreamCode method dates, cnidarians, mollusks, or non-target phyla. (Supplementary
generated more unique contributions for Annelida, Bryozoa, Table S6, Figure 3b).
Chaetognatha, and Echinodermata (Figure 2a); whereas for 18S Combining all OTUs and ZOTUs to analyse agreement among
V1–2 OTUs, the BLASTn-StreamCode method generated the methods, we observed that the three methods agreed on the phy-
largest number of unique contributions for taxa within lum assignment in 55% of sequences for 18S and 68% of sequen-
Arthropoda (Figure 2b). BLASTn-GenBank and RDP Classifier ces for COI. Agreements between the two methods were observed
had more unique contributions within Chordata, Arthropoda in 42% of sequences for 18S and 32% of sequences for COI
(for COI) and non-target phyla; BLASTn-GenBank and BLASTn- (Figure 4). The case of no agreement between methods was rare
StreamCode had the largest number of unique contributions for (0.1–3%). However, in many cases, the methods agreed only in
the phylum Cnidaria; and RDP Classifier (for COI) performed their inability to identify a specimen to phylum, e.g., in 53% of
best to identify taxa from the phylum Platyhelminthes (Figure 2). the cases all three methods were unable to identify a phylum for
A distance tree including StreamCode barcodes and OTUs for COI at the specified thresholds (Figure 4). When all three meth-
taxa assigned to Arthropoda and Annelida is presented in ods assigned a specific phylum, 98.9% of 18S V1–2 sequences
Figure 2 (lower panel), similar trees for both markers and other agreed on phylum and 99.8% of COI sequences had matching
phyla are presented in Supplementary Figure S3. The unique con- phylum assignments.
tributions to identify ZOTUs were fairly similar and are presented
in Supplementary Figure S2. Plankton types
Arthropods (predominantly crustaceans) were the most di- After removing unidentified matches or matches to non-target
verse group in our plankton samples for all markers and cluster- phyla, we found that most metabarcoding samples consisted of
ing schemes (Figure 3a). Depending on the marker and clustering holoplankton, in agreement with morphological analyses
Figure 3. Variability in the proportion of OTUs/ZOTUs belonging to each phylum identified using different genetic markers (COI and 18S
V1–2) and different approaches for taxonomic assignment (BLASTn-GenBank, BLASTn-StreamCode, RDP Classifier). (a) All samples
(unidentified samples and samples identified to phylum). (b) Only samples identified to phylum without including the phylum Arthropoda.
The RDP Classifier was used with the PR2 database for 18S and MIDORI database for COI. “Non-target” refers to taxa identified to phyla that
do not belong to the target zooplankton groups, “Unidentified” are the OTUs/ZOTUs without a phylum assignment at the defined
taxonomic thresholds. The more abundant taxa are highlighted with black silhouettes.
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
10 P. Pappalardo et al.
errors in taxonomic assignment (Santoferrara, 2019). To mini-
mize errors and improve taxonomic resolution, the implementa-
tion of metabarcoding methods and compilation of reference
databases should be associated with taxonomic experts (Clarke
et al., 2017; Porter and Hajibabaei, 2018). By integrating tradi-
tional taxonomic assessment, barcoding, and metabarcoding, we
have highlighted how a targeted effort to develop a regional DNA
barcode database can help improve metabarcoding taxonomic
assignments. Moreover, we showed that the chances of correctly
identifying taxa belonging to different marine zooplankton phyla
varied by genetic marker, the type of clustering, and the approach
used for taxonomic assignment.
During the short timeframe of the StreamCode project (one
field season, two collection weeks), a small collaboration of taxo-
nomic experts made a large contribution of barcode quality
sequences to GenBank (1108 for 18S and 1042 for COI belonging
to 15 phyla) as well as images and voucher information
(Supplementary Table S1). Hopefully, this encourages similar
work that targets taxa underrepresented in reference databases.
Local bioblitz efforts (Plumb, 2014, where there is intense biodi-
versity sampling in a small region carried out by experts and vol-
unteers) coupled with barcoding efforts are becoming popular
because they are essential for enhancing reference databases.
Large international efforts such as the Marine Barcode of Life
Figure 4. Frequency of agreement in the identification to phylum project have contributed voucher information and barcode
between the different methods for taxonomic assignment for each sequences to the Barcode Of Life (BOLD) system (Puillandre
of the two genetic markers (18S V1–2 and COI). Colours separate et al., 2012). Barcoding efforts led by non-taxonomists would
the cases in which all methods agree, two methods agree, and zero benefit from including taxonomic experts at the beginning of the
methods agree. For the cases in which two or three methods agree, project, incorporating their expertise for sampling design and col-
the percentage of agreement in the failure to identify a phylum is lection of samples (DeWalt, 2011), and also for subsequent qual-
presented within the bar. ity control checks and determination of unidentified taxa. Both
targeted efforts and large collaborations are critical to improving
(Figure 5). Although found at a comparatively smaller propor- the quality and taxonomic coverage of reference DNA databases,
tion, a meroplankton component was detected across all plankton which in turn will make metabarcoding a more efficient tool for
samples, with a larger proportion detected using morphology and characterizing communities captured in environmental samples.
the BLASTn-StreamCode method in the metabarcoding samples. Developing local DNA sequence libraries helps to overcome
For this analysis, we used the classification to lower taxonomic problems associated with incomplete publically available DNA
levels when available; however, many of the metabarcoding sam- reference databases (and with incorrect entries), one of the cur-
ples lacked enough resolution to classify taxa into holoplankton rent limitations of resolving taxonomic assignment with metabar-
or meroplankton (coded N/A in Figure 5). For each combination coding (Porter and Hajibabaei, 2018). Every new sequence counts
of clustering and genetic marker, the BLASTn-StreamCode because the addition of even a few species to a local reference
method produced the smallest number of taxa unassigned to database can improve metabarcoding taxonomic assignments
plankton type (Figure 5). In Supplementary Figure S4, we detail substantially (Abad et al., 2016); so when a large local barcode
the phyla composition for holoplankton and meroplankton iden- database is used, the improvement in taxonomic identification
tified using all methods. can be immense (e.g. up to 7-fold increase in Ransome et al.,
2017). In our study, by using the StreamCode DNA barcode data-
Discussion base (BLASTn-StreamCode method), we identified more unique
Biodiversity assessments performed with metabarcoding methods taxa to phyla that are likely less represented in reference databases
indicate astonishing levels of diversity in biological communities. (e.g. Annelida, Bryozoa, Chaetognatha; the first two also identi-
Despite the promising potential of metabarcoding methods for fied by Ransome et al., 2017 as under-represented taxa). Working
standardized biodiversity surveys (Baird and Hajibabaei, 2012; toward a comprehensive genetic inventory of marine life will not
Deiner et al., 2017), our study highlights that even identification only reduce the number of unidentified sequences in metabar-
to phylum is not straightforward for understudied ecosystems code analyses, but it will also help focus taxonomic efforts by re-
with poor representation in DNA reference databases. Identifying vealing undescribed species in poorly studied groups.
metabarcoding samples at taxonomic levels below phylum is criti- To avoid the problem of inadequately identified records (e.g.
cal for ecological studies (e.g. biogeography, connectivity, func- “environmental sample” or “uncultured metazoan”), researchers
tional groups, biological indicators, parasitism, invasive species), that apply metabarcoding methods can opt for curated databases
yet, our case study demonstrates how many metabarcoding sam- that include only sequences that have passed a quality check and
ples cannot even be identified to phylum or plankton type. have complete taxonomic information. In some cases, curated
Because bias in taxonomic identification with metabarcoding databases also require sequences to be associated with voucher
propagates to downstream analyses, it is important to minimize specimens (e.g. BOLD, http://barcodinglife.org/), which allows
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
Taxonomic expertise improves metabarcoding 11
Figure 5. Proportion of taxa identified as members of the holoplankton, meroplankton, or others (mixed life histories within a group, or a
benthic group) for the metabarcoding results with two genetic markers (COI, 18S V1–2) and three methods for taxonomic assignment,
combined with the results from the morphological analysis. (a) Metabarcoding-OTUs, (b) metabarcoding-ZOTUs, and (c) morphological
analysis. N/A indicates samples that lacked enough taxonomic resolution to be classified into holoplankton or meroplankton.
updates to identifications as taxonomic information continues to unidentified sequences for 18S V1–2 than for COI, regardless of
grow. Even though curated databases are not comprehensive, the method used for a taxonomic assignment. Other studies using
they are continuously improving and there are many efforts un- COI have also found a large number of unidentified samples
derway to develop curated databases for different taxonomic (Leray and Knowlton, 2015; Ransome et al., 2017). To better re-
groups and genetic markers (EukRef, http://eukref.org/; flect marine biodiversity, some authors recommend a first step
MetaZooGene, https://metazoogene.org/; PR2, https://pr2-data that involves a conserved marker useful for coarse taxonomic
base.org/). Despite the higher reliability of data from curated assignments (e.g. 18S V1–2 or V9), and a second step that targets
databases, for a coarse identification of marine zooplankton taxa a highly variable marker (e.g. COI) to provide finer taxonomic
to phylum, we showed that the BLASTn-GenBank approach out- resolution (Leray and Knowlton, 2016). In our study, there were
performed PR2 and Midori databases in assigning a phylum to also differences between markers in the relative taxonomic com-
most OTUs. This probably happens because PR2 and Midori: position detected (as also observed by Cowart et al., 2015;
(i) lag behind GenBank in incorporating new sequences; (ii) in- Djurhuus et al., 2018), and in the proportion of taxa that we were
clude only sequences identified to genus or species level. Finally, able to classify into plankton type. Both of these sources of varia-
the BLASTn-Streamcode method using our local database out- tion could have large impacts on the interpretation of community
performed GenBank in some groups, likely because of our more comparisons in ecological studies. Regardless of whether different
complete coverage of relevant taxa (e.g. Annelida: Polychaeta). genetic markers can vary in their specificity to detect some taxa,
Additionally, we have observed that the StreamCode DNA bar- or the ability to identify samples with different taxonomic resolu-
code database improves resolution at lower taxonomic levels as tion, multiple markers may still reflect similar trends in spatial di-
well (Pappalardo et al., 2020; ESA Abstract). versity (Pitz et al., 2020).
Importantly, given that we detected some phyla with only one An important consideration when using a multi-marker ap-
genetic marker, we support the idea that a multi-marker ap- proach is to apply a common classification scheme across markers
proach can be more effective in studies with a broader scope and DNA reference databases. In our marine zooplankton-fo-
(Cowart et al., 2015; Bucklin et al., 2016; Leray and Knowlton, cused analyses, we standardized the classification scheme used by
2016; Djurhuus et al., 2018). Both when we tested the Midori (based on NCBI) and the PR2 database (custom classifica-
StreamCode barcodes (Table 2) and when we analysed the meta- tion defined by experts) to that in WoRMS. Even though some
barcoding data (Table 3), we found a smaller proportion of aspects of the current WoRMS classification are still under debate
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
12 P. Pappalardo et al.
(e.g. placement of Sipuncula as a phylum), their large editorial mentioned specific thresholds for different taxonomic levels as a
group reviews newer publications and updates their classification “rough proxy” (order: 85%, family: 90%, genus: 95%, and spe-
when there is enough support. WoRMS also provides a useful cies: 98%), but did not specify how the thresholds were deter-
tool for matching taxon names (https://www.marinespecies.org/ mined. In metabarcoding samples containing taxa from multiple
aphia.php?p¼match) and overall, is the most current and author- phyla, uniform thresholds across different phyla may not be ad-
itative resource for the classification of marine taxa. It would be visable, due to different evolutionary rates in different taxonomic
ideal if genetic data resources (e.g. NCBI, BoLD) could adopt a groups. One example of this variation in COI can be seen in the
global taxonomic system (such as the Catalogue of Life, https:// range of confidence thresholds needed by RDP Classifier to ob-
www.catalogueoflife.org/), provided that the classification hierar- tain a correct assignment for each taxonomic level in different
chy is maintained for all groups with the best data sources. For phyla of marine organisms (results from Midori “leave-one-out-
example, even though the Catalogue of Life follows WoRMS for test”, available in http://reference-midori.info/download.php#).
marine organisms, their online website was not updated with the More research is needed on how to define confidence thresholds
current WoRMS edition at the time of submitting this article for assignment to lower taxonomic levels when using metabar-
(December, 2020). In the meantime, it falls on researchers to ad- coding methods, and how those decisions affect results.
dress taxonomic standards in multi-marker studies. Individual Increasingly, the computational methods being developed for
researchers and research groups could still adopt their own taxon- analysing metabarcode data focus on “micro-scale” variation,
omies, even rankless ones (given the arbitrary nature of Linnean such as strain-level variation (e.g. UNOISE, DADA2). These algo-
ranks) such as PhyloCode (i.e. PhyloCode, http://phylonames. rithms were developed with the primary goal of examining bacte-
org/code/). rial diversity and community structure, as bacteria are not
Our data also indicate that exploring different clustering meth- traditionally categorized as species. In fact, many microbiologists
ods can improve taxon detection. When a phylum was detected explore strain-level variation in bacteria, as it can be used to elu-
only in some of the combinations of clustering, genetic marker, cidate differences in virulence, growth patterns, etc. In contrast,
and method for taxonomic assignment (e.g. Acanthocephala, researchers examining metazoan communities to assess commu-
Porifera, Hemichordata, see Supplementary Table S6), it was usu- nity diversity, identify potential invasive species, and examine tro-
ally represented by a few (sometimes only one) OTUs or ZOTUs. phic-level interactions, are generally more interested in
We found two possible explanations for these differences in phy- examining inter-specific diversity. Thus, extending these “micro-
lum detection. Differences could appear because of the different scale” analytical tools to examination of metazoan sequences for
filtering algorithms used to generate OTUs and ZOTUs. For ex- inter-specific diversity, particularly without somewhat arbitrarily
ample, Schenk et al. (2020) found differences between OTUs and assigning a species-level sequence threshold for all taxa, can re-
ASVs (¼ZOTUs) in metabarcoding of freshwater nematode com- quire substantial additional work. On the simpler end of the spec-
munities and suggested as an explanation the more stringent fil- trum, this includes building phylogenetic trees to confirm
tering in the pipeline prior to generating ASVs. In addition, many “species-level” clustering, and on the other end involves incorpo-
of the taxa found using only OTUs by Schenk et al. (2020) were rating sophisticated species delimitation methods (e.g. multi-rate
detected with a small percentage of sequence reads. For our data, Poisson tree processes, Kapli et al., 2017). Development of analyt-
another explanation relates to the pre-defined confidence thresh- ical tools that can be used to explore metazoan diversity, particu-
olds to assign a phylum. We noticed cases in which for the raw larly taking into account sequence variation across taxonomic
data a phylum was assigned for both OTUs and ZOTUs, but with groups, is needed.
different confidence thresholds, and only one of the clustering
schemes passed the pre-defined taxonomic thresholds. Singletons Conclusion
are commonly filtered out in metabarcoding studies given the If identification of different taxa is critical for a metabarcoding
likelihood that they represent artefacts, but some authors argue study, we recommend: (i) using multiple genetic markers,
that they could be important for the detection of rare species (ii) implementing multiple methods for a taxonomic assignment,
(Brown et al., 2015). Future research using mock communities (iii) clustering the data into different types of molecular opera-
could contribute to our understanding of these topics; in the tional units, and more importantly, (iv) collaborating with taxo-
meantime, we suggest implementing different clustering nomists to develop a regional database of the groups of interest,
approaches if taxon identification is important. especially if they are underrepresented in reference databases.
Recommended confidence thresholds for accurate taxonomic This multi-analysis approach can enhance taxon detection and
assignment depend on the completeness of the reference database increase confidence in the results. There are already many such
for the focal taxon (Porter and Hajibabaei, 2018). We followed collaborations underway that are collectively populating reference
recommended guidelines to assign taxa to phylum (e.g. Ransome databases and will improve the performance of taxonomic assign-
et al., 2017; Leray et al., 2018). However, there are no agreed- ment in biodiversity surveys using metabarcoding. In addition,
upon criteria at lower taxonomic levels. For example, for meta- new tools and pipelines are continuously being developed, many
barcoding of 18S using the CREST LCAClassifier algorithm, of them in open-access platforms. To improve taxonomic assign-
Lanzén et al. (2012) used minimum similarities between related ments at lower taxonomic levels (species, genus, family), we think
taxa and cross-validation from reference datasets to propose spe- that future research should aim to develop taxon-specific thresh-
cific thresholds for each taxonomic level (phylum: 80%, class: olds for different genetic markers, to account for the different
85%, order: 90%, family: 95%, genus: 97%, and species: 99%); evolutionary rates in different taxonomic groups.
whereas Leasi et al. (2018) used BLAST when analysing the 18S
V9 region and chose different thresholds based on the literature Supplementary data
and analysis of mock communities (phylum: 90%, family: 93%, Supplementary material is available at the ICESJMS online ver-
species: >97%). For metabarcoding of COI, Elbrecht et al. (2017) sion of the manuscript.
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
Taxonomic expertise improves metabarcoding 13
Funding analyses, and expanded StreaCode data including the sampling
This work was funded through a Smithsonian National Museum locations is available in the Dryad Digital Repository at DOI
of Natural History (NMNH) Associate Director for Science Core https://doi.org/10.5061/dryad.tdz08kpzx.
Grant (2017), with additional support from a Smithsonian
Global Genome Initiative grant (GGI-Rolling-2017-109a). We References
also wish to acknowledge funding and technical support from the Abad, D., Albaina, A., Aguirre, M., Laza-Martı́nez, A., Uriarte, I.,
Iriarte, A., Villate, F., et al. 2016. Is metabarcoding suitable for es-
Smithsonian Institution Barcode Network (SIBN, FY2017 Award
tuarine plankton monitoring? A comparative study with micros-
cycle). Katrina M. Pagenkopp Lohan was funded as a Robert and copy. Marine Biology, 163: 149.
Arlene Kogod Secretarial Scholar. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J.
1990. Basic local alignment search tool. Journal of Molecular
Acknowledgements Biology, 215: 403–410.
Genetic benchwork and sequencing was completed at the Aylagas, E., Borja, Á., Irigoien, X., and Rodrı́guez-Ezpeleta, N. 2016.
Smithsonian NMNH Laboratories of Analytical Biology (LAB). Benchmarking DNA metabarcoding for biodiversity-based moni-
We thank NMNH specialists Rebecca Dikow, Mike Trizna, and toring and assessment. Frontiers in Marine Science, 3: http://jour
Matthew Kweskin for their assistance with the computations us- nal.frontiersin.org/Article/10.3389/fmars.2016.00096/abstract (last
accessed 11 January 2020).
ing the Smithsonian High Performance Cluster (SI/HPC,
Baird, D. J., and Hajibabaei, M. 2012. Biomonitoring 2.0: a new para-
Smithsonian Institution, https://doi.org/10.25572/SIHPC), and
digm in ecosystem assessment made possible by next-generation
Amanda Devine for assistance using the GGI Gap Analysis Tool. DNA sequencing. Molecular Ecology, 21: 2039–2044.
We also thank Museum Specialist Abigail Reft for her assistance Bazinet, A. L., and Cummings, M. P. 2012. A comparative evaluation
with the cephalopods data, and Julia Steier for her helpful edito- of sequence classification programs. BMC Bioinformatics, 13: 92.
rial review on the final version. We thank the crew of the M/V Bhadury, P., Austen, M., Bilton, D., Lambshead, P., Rogers, A., and
Thunderforce for technical support during zooplankton collec- Smerdon, G. 2006. Development and evaluation of a
tions. Additionally, we are grateful to Smithsonian Marine DNA-barcoding approach for the rapid identification of nemato-
Station laboratory and field support staff, Sherry Reed, Woody des. Marine Ecology Progress Series, 320: 1–9.
Lee, David Branson, and Scott Jones for their assistance, to Brown, E. A., Chain, F. J. J., Crease, T. J., MacIsaac, H. J., and
NMNH administrative support Carol Youmans, Marisol Cristescu, M. E. 2015. Divergence thresholds and divergent biodi-
versity estimates: can metabarcoding reliably describe zooplank-
Arciniega-Melendez, and Joan Kaminski, and to Lisa Comer and
ton communities? Ecology and Evolution, 5: 2234–2251.
Mark Lehtonen for collections management support. This publi-
Bucklin, A., Lindeque, P. K., Rodriguez-Ezpeleta, N., Albaina, A., and
cation is Smithsonian Marine Station contribution no. 1156. Lehtiniemi, M. 2016. Metabarcoding of marine zooplankton:
prospects, progress and pitfalls. Journal of Plankton Research, 38:
Author’s contributions 393–400.
MJB and KJO conceived the StreamCode project idea, secured Callahan, B. J., McMurdie, P. J., and Holmes, S. P. 2017. Exact se-
the initial funding and led data collection in the field. KJO devel- quence variants should replace operational taxonomic units in
oped the workflows and record keeping for the fieldwork and se- marker-gene data analysis. The ISME Journal, 11: 2639–2643.
cured additional funding. All authors excluding AB, KM, NER, Caporaso, J. G., Kuczynski, J., and Knight, R. 2010. QIIME allows
PP, and KMLP, participated in the fieldwork and morphological analysis of high-throughput community sequencing data. Nature
Methods, 7: 335–336.
classification. AGC, KMH, KJO, MJB, MV, PP, and WJ, per-
Carew, M. E., Kellar, C. R., Pettigrove, V. J., and Hoffmann, A. A.
formed quality controls on refined IDs. SBT generated the 18S 2018. Can high-throughput sequencing detect macroinvertebrate
V1–2 barcodes, KM and NER generated the COI barcodes. MJB diversity for routine monitoring of an urban river? Ecological
processed all samples for metabarcoding. KMLP developed and Indicators, 85: 440–450.
implemented the bioinformatics pipeline, KMLP and PP worked Clarke, L. J., Beard, J. M., Swadling, K. M., and Deagle, B. E. 2017.
on taxonomic assignments for sequence data. AB contributed Effect of marker choice and thermal cycling protocol on zoo-
with the gap analysis and metadata of StreamCode barcodes. PP plankton DNA metabarcoding studies. Ecology and Evolution, 7:
designed and conducted data analyses and wrote the first draft. 873–883.
KJO, KMH, KMLP, and AGC, contributed critically to the data Cowart, D. A., Pinheiro, M., Mouchel, O., Maguer, M., Grall, J.,
Miné, J., and Arnaud-Haond, S. 2015. Metabarcoding is powerful
analysis and interpretation of results. AGC, EES, JAG, KJO,
yet still blind: a comparative analysis of morphological and mo-
KMH, KMLP, MJH, MJB, MV, NER, SLB, and WJ provided feed- lecular surveys of seagrass communities. PLoS One, 10: e0117562.
back on the manuscript. All authors approved the final Cristescu, M. E. 2014. From barcoding single individuals to metabar-
manuscript. coding biological communities: towards an integrative approach
to the study of global biodiversity. Trends in Ecology &
Data availability Evolution, 29: 566–571.
The StreamCode DNA barcode sequences underlying this article Deiner, K., Bik, H. M., Mächler, E., Seymour, M., Lacoursière-
are available in the GenBank Nucleotide Database at https://www. Roussel, A., Altermatt, F., Creer, S., et al. 2017. Environmental
ncbi.nlm.nih.gov/genbank/, and can be accessed with the acces- DNA metabarcoding: transforming how we survey animal and
plant communities. Molecular Ecology, 26: 5872–5895.
sion numbers provided in Supplementary Table S1; sequences
DeWalt, R. E. 2011. DNA barcoding: a taxonomic point of view.
are also associated to NCBI BioProject PRJNA421480. The
Journal of the North American Benthological Society, 30:
StreamCode metabarcoding raw data is available through the 174–181.
NCBI Sequence Read Archive and included in the NCBI Djurhuus, A., Pitz, K., Sawaya, N. A., Rojas-Márquez, J., Michaud, B.,
BioProject PRJNA421480. The raw data includes the V9 region of Montes, E., Muller-Karger, F., et al. 2018. Evaluation of marine
18S that was not part of this analysis. The R code used for the zooplankton community structure through environmental DNA
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021
14 P. Pappalardo et al.
metabarcoding: metabarcoding zooplankton from eDNA. Pagenkopp Lohan, K. M., Fleischer, R. C., Carney, K. J., Holzer, K.
Limnology and Oceanography: Methods, 16: 209–221. K., and Ruiz, G. M. 2016. Amplicon-based pyrosequencing reveals
Edgar, R. C. 2013. UPARSE: highly accurate OTU sequences from high diversity of protistan parasites in ships’ ballast water: impli-
microbial amplicon reads. Nature Methods, 10: 996–998. cations for biogeography and infectious diseases. Microbial
Edgar, R. C. 2016. UNOISE2: improved error-correction for Illumina Ecology, 71: 530–542.
16S and ITS amplicon sequencing. bioRxiv, 081257. https://www. Pappalardo, P., Pagenkopp Lohan, K. M., Boyle, M. J., Collins, A. G.,
semanticscholar.org/paper/UNOISE2%3A-improved-error-correc Hanson, K. M., Truskey, S. B., Jaekle, W., et al. 2020. Improving tax-
tion-for-Illumina-16S-Edgar/a9867e89687a6fb581e664c8880e743 onomic assignment of DNA metabarcoding with taxonomic exper-
8af8c8b5a tise. Ecological Society of America Annual Meeting, Aug 3-6. https://
Edgar, R. C. 2018. Updating the 97% identity threshold for 16S ribo- eco.confex.com/eco/2020/meetingapp.cgi/Paper/87812.
somal RNA OTUs. Bioinformatics, 34: 2371–2375. Parry, L. A., Edgecombe, G. D., Eibye-Jacobsen, D., and Vinther, J.
Elbrecht, V., Vamos, E. E., Meissner, K., Aroviita, J., and Leese, F. 2016. The impact of fossil data on annelid phylogeny inferred
2017. Assessing strengths and weaknesses of DNA metabarco- from discrete morphological characters. Proceedings of the Royal
ding-based macroinvertebrate identification for routine stream Society B. Biological Sciences, 283: 20161378.
monitoring. Methods in Ecology and Evolution, 8: 1265–1275. Pitz, K. J., Guo, J., Johnson, S. B., Campbell, T. L., Zhang, H.,
Guillou, L., Bachar, D., Audic, S., Bass, D., Berney, C., Bittner, L., Vrijenhoek, R. C., Chavez, F. P., et al. 2020. Zooplankton biogeo-
Boutte, C., et al. 2012. The Protist Ribosomal Reference database graphic boundaries in the California Current System as deter-
(PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA mined from metabarcoding. PLoS One, 15: e0235159.
sequences with curated taxonomy. Nucleic Acids Research, 41: Porter, T. M., and Hajibabaei, M. 2018. Scaling up: a guide to high-
D597–D604. throughput genomic approaches for biodiversity analysis.
Horton, T., Kroh, A., Ahyong, S., Bailly, N., Boyko, C. B., Brand~ao, S. N., Molecular Ecology, 27: 313–338.
Gofas, S., et al. 2020. World Register of Marine Species (WoRMS). Puillandre, N., Bouchet, P., Boisselier-Dubayle, M.-C., Brisset, J.,
WoRMS Editorial Board. https://www.marinespecies.org. Buge, B., Castelin, M., Chagnoux, S., et al. 2012. New taxonomy
Kapli, P., Lutteropp, S., Zhang, J., Kobert, K., Pavlidis, P., Stamatakis, and old collections: integrating DNA barcoding into the collection
A., and Flouri, T. 2017. Multi-rate Poisson tree processes for single-- curation process. Molecular Ecology Resources, 12: 396–402.
locus species delimitation under maximum likelihood and Markov Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P.,
chain Monte Carlo. Journal of Plankton Research, 41: 571–582. Peplies, J., et al. 2012. The SILVA ribosomal RNA gene database
Lanzén, A., Jørgensen, S. L., Huson, D. H., Gorfer, M., Grindhaug, S. project: improved data processing and web-based tools. Nucleic
H., Jonassen, I., Øvreås, L., et al. 2012. CREST – classification Acids Research, 41: D590–D596.
resources for environmental sequence tags. PLoS One, 7: e49334.
R Core Team. 2020. R: A Language and Environment for Statistical
Leasi, F., Sevigny, J. L., Laflamme, E. M., Artois, T., Curini-Galletti, Computing. R Foundation for Statistical Computing, Vienna,
M., de Jesus Navarrete, A., Di Domenico, M., et al. 2018. Austria. https://www.R-project.org/.
Biodiversity estimates and ecological interpretations of meiofau-
Ransome, E., Geller, J. B., Timmers, M., Leray, M., Mahardini, A.,
nal communities are biased by the taxonomic approach.
Sembiring, A., Collins, A. G., et al. 2017. The importance of stan-
Communications Biology, 1: 112.
dardization for biodiversity comparisons: a case study using au-
Lejzerowicz, F., Esling, P., Pillet, L., Wilding, T. A., Black, K. D., and
tonomous reef monitoring structures (ARMS) and
Pawlowski, J. 2015. High-throughput sequencing and morphol-
metabarcoding to measure cryptic diversity on Mo’orea coral
ogy perform equally well for benthic monitoring of marine
reefs, French Polynesia. PLoS One, 12: e0175066.
ecosystems. Scientific Reports, 5: 13932.
Ratnasingham, S., and Hebert, P. D. N. 2007. BOLD: the barcode of
Leray, M., and Knowlton, N. 2015. DNA barcoding and metabarcod-
life data system (http://www.barcodinglife.org). Molecular
ing of standardized samples reveal patterns of marine benthic di-
Ecology Notes, 7: 355–364.
versity. Proceedings of the National Academy of Sciences, 112:
2076–2081. Richardson, R. T., Bengtsson-Palme, J., and Johnson, R. M. 2017.
Evaluating and optimizing the performance of software com-
Leray, M., and Knowlton, N. 2016. Censusing marine eukaryotic di-
versity in the twenty-first century. Philosophical Transactions of monly used for the taxonomic classification of DNA metabarcod-
the Royal Society B: Biological Sciences, 371: 20150331. ing sequence data. Molecular Ecology Resources, 17: 760–769.
Leray, M., Ho, S.-L., Lin, I.-J., and Machida, R. J. 2018. MIDORI Santoferrara, L. F. 2019. Current practice in plankton metabarcoding:
server: a webserver for taxonomic assignment of unknown meta- optimization and error management. Journal of Plankton
zoan mitochondrial-encoded sequences using a curated database. Research, 41: 571–582.
Bioinformatics, 34: 3753–3754. Schenk, J., Kleinbölting, N., and Traunspurger, W. 2020. Comparison
Lindeque, P. K., Parry, H. E., Harmer, R. A., Somerfield, P. J., and of morphological, DNA barcoding, and metabarcoding character-
Atkinson, A. 2013. Next generation sequencing reveals the hidden izations of freshwater nematode communities. Ecology and
diversity of zooplankton assemblages. PLoS One, 8: e81327. Evolution, 10: 2885–2899.
Lobo, J., Shokralla, S., Costa, M. H., Hajibabaei, M., and Costa, F. O. Struck, T. H., Paul, C., Hill, N., Hartmann, S., Hösel, C., Kube, M.,
2017. DNA metabarcoding for high-throughput monitoring of Lieb, B., et al. 2011. Phylogenomic analyses unravel annelid evolu-
estuarine macrobenthic communities. Scientific Reports, 7: 15618. tion. Nature, 471: 95–98.
Machida, R. J., Leray, M., Ho, S.-L., and Knowlton, N. 2017. Tang, C. Q., Leasi, F., Obertegger, U., Kieneke, A., Barraclough, T.
Metazoan mitochondrial gene sequence reference datasets for tax- G., and Fontaneto, D. 2012. The widely used small subunit 18S
onomic assignment of environmental samples. Scientific Data, 4: rDNA molecule greatly underestimates true diversity in biodiver-
170027. sity surveys of the meiofauna. Proceedings of the National
Pagenkopp Lohan, K., Campbell, T. L., Guo, J., Wheelock, M., Academy of Sciences, 109: 16208–16212.
DiMaria, R. A., and Geller, J. B. 2019. Intact vs. homogenized Wang, Q., Garrity, G. M., Tiedje, J. M., and Cole, J. R. 2007. Naive
subsampling: testing impacts of pre-extraction processing of Bayesian classifier for rapid assignment of rRNA sequences into
multi-species samples on invasive species detection. Management the new bacterial taxonomy. Applied and Environmental
of Biological Invasions, 10: 324–341. Microbiology, 73: 5261–5267.
Handling editor: William Grant
Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021