ICES Journal of Marine Science (2021), doi:10.1093/icesjms/fsab082 Contribution to the Themed Section: ‘Patterns of Biodiversity of Marine Zooplankton Based on Molecular Analysis’ The role of taxonomic expertise in interpretation of metabarcoding studies Paula Pappalardo 1*, Allen G. Collins 1,2, Katrina M. Pagenkopp Lohan 3, Kate M. Hanson 4, Sarit B. Truskey 1,5, William Jaeckle 6, Cheryl Lewis Ames 1,7, Jessica A. Goodheart 1, Stephanie L. Bush 1,8,9, Leann M. Biancani 1, Ellen E. Strong 1, Michael Vecchione 1,2, M. G. Harasewych 1, Karen Reed 1, Chan Lin 1, Elise C. Hartil 10, Jessica Whelpley11, Jamie Blumberg 6, Kenan Matterson 12, Niamh E. Redmond 13, Allison Becker 13, Michael J. Boyle 14,†, and Karen J. Osborn 1,† 1Department of Invertebrate Zoology, Smithsonian National Museum of Natural History, Washington, DC, USA 2National Systematics Laboratory, NOAA’s National Marine Fisheries Service, Smithsonian National Museum of Natural History, Washington, DC, USA 3Marine Disease Ecology Laboratory, Smithsonian Environmental Research Center, Edgewater, MD, USA 4Biology Department, Duke University, Durham, NC, USA 5Marine Science Center, Northeastern University, Nahant, MA, USA 6Biology Department, Illinois Wesleyan University, Bloomington, IL, USA 7International Integrative Research and Instruction, Graduate School of Agricultural Science, Tohoku University, Sendai, Japan 8Monterey Bay Aquarium Research Institute, Moss Landing, Midwater Ecology Lab, CA, USA 9Monterey Bay Aquarium, Animal Husbandry Department, Monterey, CA, USA 10School of Marine Sciences, University of Maine, Orono, ME, USA 11Whitney Laboratory for Marine Bioscience, University of Florida, St. Augustine, FL, USA 12Department of Biological, Geological, and Environmental Sciences, University of Bologna, Ravenna, Italy 13Smithsonian Institution Barcode Network, Smithsonian National Museum of Natural History, Washington, DC, USA 14Smithsonian Marine Station, Smithsonian National Museum of Natural History, Fort Pierce, FL, USA *Corresponding author: e-mail: paulapappalardo@gmail.com. †The last two authors are co-senior authors. Pappalardo, P., Collins, A. G., Pagenkopp Lohan, K. M., Hanson, K. M., Truskey, S. B., Jaeckle, W., Ames, C. L., Goodheart, J. A., Bush, S. L., Biancani, L. M., Strong, E. E., Vecchione, M., Harasewych, M. G., Reed, K., Lin, C., Hartil, E. C., Whelpley, J., Blumberg, J., Matterson, K., Redmond, N. E., Becker, A., Boyle, M. J., and Osborn, K. J. The role of taxonomic expertise in interpretation of metabarcoding studies. – ICES Journal of Marine Science, doi:10.1093/icesjms/fsab082. Received 29 December 2020; revised 5 April 2021; accepted 5 April 2021. The performance of DNA metabarcoding approaches for characterizing biodiversity can be influenced by multiple factors. Here, we used mor- phological assessment of taxa in zooplankton samples to develop a large barcode database and to assess the congruence of taxonomic identi- fication with metabarcoding under different conditions. We analysed taxonomic assignment of metabarcoded samples using two genetic markers (COI, 18S V1–2), two types of clustering into molecular operational taxonomic units (OTUs, ZOTUs), and three methods for taxo- nomic assignment (RDP Classifier, BLASTn to GenBank, BLASTn to a local barcode database). The local database includes 1042 COI and 1108 18S (SSU) barcode sequences, and we added new high-quality sequences to GenBank for both markers, including 109 contributions at the VC International Council for the Exploration of the Sea 2021. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/ licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is prop- erly cited. Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 2 P. Pappalardo et al. species level. The number of phyla detected and the number of taxa identified to phylum varied between a genetic marker and among the three methods used for taxonomic assignments. Blasting the metabarcodes to the local database generated multiple unique contributions to identify OTUs and ZOTUs. We argue that a multi-marker approach combined with taxonomic expertise to develop a curated, vouchered, lo- cal barcode database increases taxon detection with metabarcoding, and its potential as a tool for zooplankton biodiversity surveys. Keywords: barcode, biodiversity, Gulf Stream, holoplankton, meroplankton, metabarcode, zooplankton Introduction Linnaean ranks (Guillou et al., 2012). The Midori database Recent studies suggest that metabarcoding could eventually re- includes sequences from GenBank with taxonomic information place traditional morphological methods for biodiversity surveys at the species level for 13 protein-coding genes (including COI) and ecosystem assessment (Lejzerowicz et al., 2015; Aylagas et al., and two ribosomal RNA (lrRNA and srRNA; Machida et al., 2016; Elbrecht et al., 2017; Lobo et al., 2017; Carew et al., 2018). 2017). The stringent filtering of GenBank data improves data reli- By using DNA-identification to analyse a mixture of unidentified ability in Midori, but greatly reduces the representation of known species, metabarcoding offers the potential to dramatically reduce species diversity (Machida et al., 2017). While not often quanti- the time and cost of biodiversity surveys while increasing detec- fied, similarly uneven and incomplete representation of extant tion of taxa that are difficult to identify based on morphology taxa should be expected for other curated databases. alone (e.g. invasive species, Abad et al., 2016; cryptic species, The identification at any level from population to phylum of Aylagas et al., 2016; parasites, Pagenkopp Lohan et al., 2016). metabarcoding sequence profiles depends not only on the com- However, in many environments, it remains to be demonstrated pleteness of the reference database but also on the algorithm that metabarcoding can perform better than traditional morpho- used. Following Bazinet and Cummings (2012), these algorithms logical assessments by taxonomic experts. The performance of can be classified as (i) similarity-based (e.g. programs using metabarcoding approaches is highly dependent on multiple fac- BLAST), (ii) composition-based (e.g. programs using Naive tors, including how the initial samples are processed (Ransome Bayes Classifier), and (iii) phylogeny-based (e.g. programs using et al., 2017; Pagenkopp Lohan et al., 2019), the choice of primers maximum likelihood or Bayesian methods). Even within a cate- (Lobo et al., 2017), PCR profiles (Aylagas et al., 2016; Clarke gory, the performance of different bioinformatics pipelines can et al., 2017), genetic markers (Pitz et al., 2020), the completeness vary widely when applied to the same dataset (Bazinet and of the DNA sequence reference database (Lindeque et al., 2013), Cummings, 2012; Pitz et al., 2020). In addition, the resolution and even the software used for taxonomic assignment (Bazinet and error rate of taxonomic assignments can vary for different ge- and Cummings, 2012). netic markers (Richardson et al., 2017; Pitz et al., 2020). Opinions Even though traditional biodiversity surveys based on mor- on which markers work best vary greatly even for the same taxo- phological assessments are time consuming and depend on the nomic group (Bhadury et al., 2006; Tang et al., 2012). In studies level of taxonomic expertise available, they may capture relevant targeting taxonomically diverse assemblages, there is agreement details for population and community analyses (such as sex, life that a multi-marker approach is more effective (Cowart et al., history stage, and relative abundance of taxa, Lindeque et al., 2015; Bucklin et al., 2016; Djurhuus et al., 2018), but continued 2013). Most importantly, taxonomic experts can contribute DNA assessment of different methods and markers for taxonomic as- sequences to reference databases based on confidently identified signment remains critical to enable metabarcoding data for stud- and well-documented specimens deposited in a vouchering insti- ies of biodiversity and ecological research. tution. When paired with metabarcoding, this traditional ap- A more fundamental methodological choice in any metabar- proach can dramatically reduce the number of unidentified (or coding study is to determine how to generate the final set of misidentified) sequences (Ransome et al., 2017), which in many sequences for taxonomic identification. Sequences are usually studies, comprise a large part of a community sample (e.g. 93% clustered into operational taxonomic units (OTUs), using a de- of taxa for the COI marker in Cowart et al., 2015). These uniden- fined similarity threshold (typically 97%) that is likely to repre- tified sequences may actually represent species that are new to sci- sent different species. But recent studies have suggested that ence; however, most often they represent known taxa that are not individual unique amplicons (sequences amplified from an envi- represented in current reference DNA-sequence databases. ronmental DNA sample) make metabarcoding more reproducible Therefore, experts agree that continued contribution of vouch- and better represent the full biological diversity within environ- ered sequences to existing reference databases will be a major step mental samples (Callahan et al., 2017; Edgar, 2018). These unique to fulfil the promise of metabarcoding approaches (Cristescu, sequences have been called amplicon sequence variants (ASVs) or 2014; Cowart et al., 2015; Bucklin et al., 2016; Ransome et al., zero-radius OTUs (ZOTUs) and capture both inter-specific and 2017; Porter and Hajibabaei, 2018). intra-specific variability. While the full impact of this choice on Several curated databases have been developed in parallel to derived research questions is still being assessed, it affects the the implementation of metabarcoding methods, emphasizing spe- number of taxa detected (Schenk et al., 2020). cific taxonomic groups or genetic markers (e.g. BOLD, Our study combined expert taxonomic knowledge with stan- Ratnasingham and Hebert, 2007; SILVA, Quast et al., 2012; see dard DNA barcoding and examines how metabarcoding results Table 2 of Porter and Hajibabaei, 2018 for a summary of data- can be improved by using taxonomic expertise to develop a cu- bases commonly used in metabarcoding studies). PR2 and rated database of vouchered DNA barcodes for multiple genetic Midori are additional curated databases that were not assessed in markers. We focused on a challenging yet commonly sampled as- Porter and Hajibabaei (2018). The PR2 database includes rRNA semblage, marine zooplankton communities. Given their broad sequences with a curated taxonomy structured to conform with taxonomic diversity, their roles as critical members of ocean food Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 Taxonomic expertise improves metabarcoding 3 webs, their many morphologically distinctive early life stages, and MA) for extraction, PCR amplification and Sanger sequencing the extensive taxonomic expertise required for accurate morpho- (see DNA Barcoding below). All morphological and image logical identifications, marine invertebrate zooplankton are a nat- vouchers generated by the StreamCode project (Supplementary ural target for diversity surveys using a metabarcoding approach. Table S1), as well as tissues and DNA aliquots, are available in the Specifically, we compared the resulting taxonomic identification collections of the National Museum of Natural History of metabarcoding samples analysed using: (i) different markers, (NMNH). i.e., either COI or 18S V1–2, (ii) different bioinformatic pipelines, After each of the samples was sequenced for COI and 18S V1– and (iii) different operational taxonomic units (OTUs and 2 regions, a two-pronged approach was taken to verify and refine ZOTUs). The different methods were compared among each the initial identifications: (i) BLAST and alignment based and (ii) other, since the true community composition is unknown. To ac- phylogeny based. Both approaches were run on the Smithsonian complish these goals, the StreamCode Project (a collaborative ef- Institution’s High Performance Computing Cluster (https://doi. fort led by researchers in the Smithsonian National Museum of org/10.25572/SIHPC), see Supplementary material for additional Natural History), made multiple collections of plankton from a details. For the BLAST approach (Altschul et al., 1990), each se- specific geographic area and developed one of the largest curated quence was compared against sequences in GenBank and personal zooplankton databases that includes DNA barcodes. We analyse reference libraries to verify taxon assignments. For the phylogeny- if this regional database can help to identify organisms to higher based approach, we combined target sequences, related sequences taxonomic levels such as phylum, and also if it can increase the from GenBank, and sequences from personal libraries to build taxonomic resolution necessary to classify zooplankton into eco- phylogenetic trees and to determine the placement of each target logical plankton type to identify permanent (holoplankton) or sequence in relation to known sequences. Based on the initial field temporary (meroplankton) residents of the plankton. IDs, results of the BLASTn searches, the phylogenetic trees, photo- graphs, and voucher specimens, taxonomic experts assigned a final Methods ID to each sample (Supplementary Table S1). StreamCode project plankton collection Twelve plankton tow samples were collected from the Florida DNA barcoding Current of the Gulf Stream off the Atlantic coast near Fort Pierce, Detailed sequencing methods, primers used (Supplementary Florida (27.45N 79.95W) in June and August 2017 between the Table S2), and quality control protocols are provided in the surface and 77–145 m depth. To maximize the diversity of taxa Supplementary material. DNA extraction, amplification, and se- recovered, we used both a circular mouth plankton net (mesh quencing were performed in the Smithsonian Laboratories of size 209mm) and a modified mid-water trawl (1.0 m2 square Analytical Biology (LAB), NMNH. All 18S V1–2 and COI sequen- mouth net with 500mm mesh). Upon collection and mixing, ap- ces produced in the current study, the StreamCode DNA Barcode proximately one-quarter of each plankton sample was separated Database, were uploaded to GenBank (https://www.ncbi.nlm.nih. for metabarcoding analysis. Each tow-specific subsample desig- gov/genbank/, NCBI BioProject PRJNA421480, see Supplementary nated for the metabarcoding analysis was concentrated by pour- Table S1 for accession numbers). ing it through a 50mm mesh nylon filter, rinsed with ultracold 95% ethanol into a 50 ml polypropylene tube, stored on dry ice StreamCode DNA barcode database for transport to the Smithsonian Marine Station at Fort Pierce To identify novel contributions of the StreamCode DNA barcode (SMSFP), and stored at 80C until processing. The leftover por- database to GenBank, we used the gap analysis tool developed by tion of the mixed tow sample was diluted in at least 10 l of seawa- the Global Genome Initiative (GGI) for each refined ID ter, kept aerated and chilled during transport to SMSFP, and used (Supplementary Figure S1) at the rank of species and higher taxo- for live sorting of individuals. nomic levels (GGI Gap Analysis Tool, 2019; https://www.global geno.me/gaps/live), GGI tool last accessed on 29 October 2020. Taxonomic identification using morphology The live gap analysis tool searches GenBank for each name con- Live plankton were hand-sorted based on taxonomic groups un- sidering only sequences of barcode quality, defined as GenBank der stereomicroscopes at the SMSFP by a team of taxonomic sequence records for COI, rbcL, matK, and/or ITS with a se- experts and students. Animals were selectively picked with the quence length > 500 bp that include information identifying a goal of maximizing the diversity of samples collected from focal scientific name, a voucher specimen, and the amplification groups across 15 invertebrate phyla (between 1 and 200 individu- primer sequences (GGI Gap Analysis Tool, 2019). From the two als per focal group, Table 1). For some groups (pteropods, holo- markers we used for this study (COI and 18S), only COI is offi- planktonic polychaetes), selective sampling was able to capture cially considered a barcode and included in the Gap Analysis most of the expected species diversity. For species-rich groups Tool; however, we modified this tool with an in-house Python such as copepods, the sorting effort captured only a fraction of script to search for 18S sequences in GenBank separately for each the expected species diversity, and experts targeted taxa not yet Phylum. represented in GenBank. Live specimens were grouped by mor- photype and classified to the lowest taxonomic level possible. DNA metabarcoding and bioinformatics protocol Individuals and populations of live specimens were photo- DNA extraction, PCR amplification, and multiplex library prepara- graphed in the laboratory at SMSFP (see photography details and tion of pooled amplicons were performed at the SMSFP; final how to access images in the Supplementary material). The high- quantification and library-size determination steps for sequencing resolution images allowed us to verify the initial identifications protocols were performed by staff at LAB. Raw metabarcoding made in the field. Tissue samples or whole animals were placed data is available through NCBI Sequence Read Archive and can be directly in 150ml TD-M2 tissue buffer (Autogen, Inc., Hollister, found in NCBI BioProject PRJNA421480 (https://www.ncbi.nlm. Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 4 P. Pappalardo et al. Table 1. Barcoding success. Number of Total number COI 18S % success % success % success Phylum Target group morphospecies of individuals barcodes barcodes Individuals barcoded for both COI 18S both Annelida Class Polychaeta 57 160 107 115 93 66.88 71.88 58.13 Arthropoda Subclass Copepoda 51 95 77 95 77 81.05 100 81.05 Suborder Hyperiidea 60 202 177 132 122 87.62 65.35 60.4 Miscelaneous arthropods 78 130 108 116 103 83.08 89.23 79.23 Class Ostracoda 18 91 53 70 41 58.24 76.92 45.05 Brachiopoda Phylum Brachiopoda 2 12 12 12 12 100 100 100 Bryozoa Phylum Bryozoa 3 7 6 7 6 85.71 100 85.71 Chaetognatha Phylum Chaetognatha 10 43 19 12 8 44.19 27.91 18.6 Chordata Class Appendicularia 7 24 5 16 4 20.83 66.67 16.67 Chordata Class Thaliacea 7 17 1 15 1 5.88 88.24 5.88 Cnidaria Class Anthozoa 8 27 20 20 17 74.07 74.07 62.96 Class Hydrozoa 38 103 58 86 57 56.31 83.5 55.34 Class Scyphozoa 4 19 17 17 17 89.47 89.47 89.47 Ctenophora Phylum Ctenophora 3 5 0 5 0 0 100 0 Echinodermata Class Asteroidea 6 30 28 27 25 93.33 90 83.33 Class Echinoidea 5 7 7 7 7 100 100 100 Class Holothuroidea 5 13 12 13 12 92.31 100 92.31 Class Ophiuroidea 11 26 17 23 17 65.38 88.46 65.38 Hemichordata Phylum Hemichordata 6 16 8 6 6 50 37.5 37.5 Mollusca Class Bivalvia 5 10 8 10 8 80 100 80 Class Cephalopoda 10 20 17 15 15 85 75 75 Class Gastropoda 51 96 76 73 66 79.17 76.04 68.75 Order Pteropoda 33 209 177 181 163 84.69 86.6 77.99 Nemertea Phylum Nemertea 1 1 0 0 0 0 0 0 Phoronida Phylum Phoronida 1 1 1 1 1 100 100 100 Sipuncula Phylum Sipuncula 19 35 31 34 31 88.57 97.14 88.57 Total 499 1399 1042 1108 Total number of morphospecies, total number of StreamCode samples collected for each taxonomic group, details of number of samples for each genetic marker that yielded useful sequences, and percentage of sam- ples that were successfully sequenced for COI and 18S V1–2. Cells in red highlight the groups with less than 75% success. Miscellaneous arthropods include non-hyperiid amphipods, and members of the orders Cumacea, Decapoda, Euphausiacea, Mysida, and Stomatopoda. Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 Taxonomic expertise improves metabarcoding 5 Table 2. Performance of methods to assign barcodes to phylum using the StreamCode DNA barcode database. RDP classifier BLASTn to GenBank BLASTn to StreamCode Marker StreamCode Phylum N barcodes % identified % wrong % identified % wrong % identified % wrong COI Annelida 107 6.5 14.3 22.4 0 83.2 0 Arthropoda 415 78.8 0 74.2 0 90.1 0 Brachiopoda 12 NA NA NA NA 100 0 Bryozoa 6 NA NA NA NA 100 0 Chaetognatha 19 89.5 5.9 84.2 0 84.2 0 Chordata 6 NA NA NA NA 66.7 0 Cnidaria 95 74.7 0 72.6 0 86.3 0 Echinodermata 64 89.1 0 64.1 0 84.4 0 Hemichordata 8 12.5 100 NA NA 87.5 0 Mollusca 278 75.2 0 77 0 87.8 0 Phoronida 1 NA NA NA NA NA NA Sipuncula 31 35.5 0 74.2 0 77.4 0 18S V1–2 Annelida 115 88.7 8.8 99.1 1.8 100 0 Arthropoda 413 87.2 0 96.9 0 99.5 0 Brachiopoda 12 100 0 100 0 100 0 Bryozoa 7 100 0 100 0 100 0 Chaetognatha 12 91.7 0 100 0 100 0 Chordata 31 100 0 100 0 100 0 Cnidaria 123 94.3 0 100 0 100 2.4 Ctenophora 5 100 0 100 0 100 0 Echinodermata 70 87.1 0 100 0 100 0 Hemichordata 6 100 0 100 0 100 0 Mollusca 279 95 0 99.6 0 99.3 0 Phoronida 1 100 0 100 0 100 100 Sipuncula 34 88.2 0 100 0 97.1 0 Total number of barcodes tested, percentage of barcodes identified to phylum, and percentage of barcodes with a wrong assignment for each marker and StreamCode phylum. To accept an identification to phylum, the confidence threshold was equal to or larger than 0.90 with RDP Classifier, and percent similar- ity was equal to or higher than 85% for BLASTn. The RDP Classifier was used with the Midori database for COI and with the PR2 database for 18S. NA (not ap- plicable) are instances where no assignment passed the confidence threshold. nih.gov/bioproject/?term¼PRJNA421480). Additional details on using BLASTn against the StreamCode DNA barcode database primers (Supplementary Table S3), sequencing methods and bioin- (we refer to this approach as BLASTn-StreamCode). formatics pipeline are provided in the Supplementary material. The RDP classifier (Wang et al., 2007) was implemented in Reads were merged and filtered with USEARCH 10.0 (Edgar, QIIME (Caporaso et al., 2010) for the taxonomic assignment of 2013). Post-merging, the allowable sequence range was 400– the 18S sequences, selecting PR2 11.1 as the reference database 500 bp for 18S V1–2 and 300–400 bp for COI. Primer specific (Guillou et al., 2012, PR2 database includes rRNA sequences with barcodes were demultiplexed in QIIME (Caporaso et al., 2010). curated taxonomy). For the COI sequences, the RDP Classifier In USEARCH, primers were removed, and then unique sequences was implemented in the Midori server queried on 3 October 2020 were identified, de-replicated, and sorted by abundance. Because (http://reference-midori.info/server.php; Leray et al., 2018), using the “species level” clustering threshold likely varies by taxonomic the Midori 2-LONGEST database as the reference (Machida et al., group, we took two clustering approaches: (i) clustering sequen- 2017; filtered database of mitochondrial-encoded genes for meta- ces into operational taxonomic units (OTUs) at 97% similarity zoans). We accepted a phylum classification as correct when the after removing singletons and (ii) clustering sequences into zero- confidence threshold was equal to or larger than 0.90. Based on radius operational taxonomic units (ZOTUs), allowing at least the information available in Midori, we observed that using 0.90 four copies, which is recommended for smaller datasets (Edgar, as the confidence threshold for COI yields an error rate less than 2016). The number of metabarcode sequences in each filtering 1% for most phyla commonly found in zooplankton assemblages step is detailed in Supplementary Table S4. The variability in the (http://reference-midori.info/download.php#; follow the path number of OTUs per phylum with an alternative OTU filtering Archive/Leave-one-sequence-out_test_1.1/COI). We are not (keeping only sequences with at least four copies) is presented in aware of similar specific guidelines for 18S using the PR2 data- Supplementary Table S5. base, so we applied the same confidence threshold to 18S. For the BLASTn-GenBank approach, we assigned taxonomy to Assigning taxonomy to the metabarcode sequences the OTUs and ZOTUs by running a BLASTn search with default We took three approaches to assign taxonomic identities to options on 6th October 2020. We filtered out matches lacking OTUs and ZOTUs: (i) taxonomic assignment using the RDP specific taxonomic information using custom scripts in R (R Classifier (Wang et al., 2007) with Midori or PR2 as reference Core Team, 2020, see Data Availability for code). For the databases for COI and 18S, respectively; (ii) taxonomic assign- BLASTn-StreamCode approach, we created a reference database ment using BLASTn against the NCBI nt database (we refer to with the StreamCode voucher-based barcodes for each marker this approach as BLASTn-GenBank); (iii) taxonomic assignment and assigned taxonomy using BLASTn. A match to phylum was Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 6 P. Pappalardo et al. accepted as correct if the percent similarity and sequence coverage resulting from the use of different genetic markers (COI, 18S were equal to or larger than 85% (Ransome et al., 2017). At least V1–2), different types of clustering (OTUs, ZOTUs), and differ- for COI, Ransome et al. (2017) showed that an 85% similarity ent approaches to assigning taxonomy (RDP Classifier with threshold provides a relatively small error rate (0.7%) when MIDORI/PR2, BLASTn-GenBank, and BLASTn-StreamCode). assigning sequences to phylum. We calculated the number (and proportion) of OTUs/ZOTUs for At the beginning of this project, we intended to compare each phylum, genetic marker, and approach for a taxonomic agreements at various taxonomic levels, but we did not find stan- assignment. dardized recommendations. Initial explorations with our dataset showed that we would need to generate specific thresholds of tax- Agreements and unique contributions among methods onomic assignments for each marker, method, and taxonomic for assignment to phylum group. For this reason, we focused our main analysis on the For all the OTUs/ZOTUs, we quantified the agreement between assignments to phylum. methods for taxonomic assignment to phylum. We recorded whether all methods agreed, only two agreed, or zero agreed. Taxonomic matching between databases Additionally, we kept records of how many agreements were to a To standardize the taxonomic names across the morphological specific phylum (e.g. Annelida) versus agreements in failure to assignments and the different databases consulted in the metabar- identify a phylum (e.g. Unidentified). coding analysis (PR2, MIDORI, and GenBank), we followed the For the taxonomic groups best studied by the StreamCode tax- classification in the World Register of Marine Species (WoRMS, onomic experts (Annelida, Arthropoda, Cnidaria, Mollusca, and Horton et al., 2020). All names corresponding to marine organ- Sipuncula), we combined the barcodes and OTUs into a phyloge- isms were matched using the WoRMS online taxon match tool ¼ netic tree (details in Supplementary material). For the metabarco-(http://www.marinespecies.org/aphia.php?p match), last des, we highlighted the cases in which two or three methods accessed in October 2020. For matches to non-marine organisms, agreed in the assignment and unique cases in which only one we followed the NCBI taxonomy database (https://www.ncbi. method was able to assign the sequence to a phylum (we refer to nlm.nih.gov/taxonomy), last accessed in October 2020. When a these cases as unique contributions). phylum assignment of a sequence did not pass the corresponding thresholds for each method, the phylum was labelled “Unidentified.” Plankton types Standardizing names across databases is important because the To investigate a categorical approach where it is necessary to taxonomic differences between databases are not minor. To give match taxa with an ecological planktonic type, we classified one example, the PR2 database considers Urochordata to be a OTUs/ZOTUs into meroplankton and holoplankton. Meroplankton phylum, but the WoRMS classification considers Urochordata to consists of the life-history stages of benthic organisms that live be subphylum Tunicata (within the Phylum Chordata). There is temporarily in the water column (e.g. eggs, larvae, and adults in some debate about the placement of Sipuncula as a subgroup of the case of medusozoan cnidarians), whereas holoplankton refers the phylum Annelida (Struck et al., 2011; Parry et al., 2016); to be to organisms that live permanently within the water column. consistent in our analysis, we followed the WoRMS classification, Although in some cases all members of a phylum belong to the as of November 2020, that considers Sipuncula a phylum. In ad- same plankton type (e.g. all Sipuncula were meroplankton), for dition, there were a few cases in which the minimum taxonomic Annelida, Arthropoda, Chordata, Cnidaria, and Mollusca, we used level reported in the output (by MIDORI or PR2) did not match customized taxonomic filters that involved taxonomic assignments the phylum assigned by WoRMS. When that happened, we ran below the level of phylum (i.e. Class, Order, Family). In this analy- independent BLASTn searches of each OTU/ZOTU and verified sis, we excluded OTUs/ZOTUs assigned as “Unidentified” or as that the WoRMS assignment was correct. We enumerate those non-target phyla. cases in the R code provided in the Dryad data package. Results Validating methods for a taxonomic assignment using StreamCode project the StreamCode DNA barcode database The collection, handling, morphological identification, and pho- To validate the performance of the taxonomic assignment to phy- tographing of zooplankton samples involved approximately lum for the metabarcoding data, we ran the StreamCode COI and 1800 h of field work by 31 persons, or 58 person-hours. Of the 18S barcodes using the same three methods for a taxonomic as- 2260 specimens collected, representative individuals were cata- signment detailed for the metabarcoding data: (i) RDP Classifier, logued (n¼ 1529; Supplementary Table S1), tissue sampled (ii) BLASTn-GenBank, and (iii) BLASTn-StreamCode, without (n¼ 1399), photographed (n¼ 1149, Figure 1), and, when possi- including the query sequence. As not all the barcodes were suc- ble, a morphological voucher was deposited into the National cessfully identified at the quality thresholds specified, we calcu- Museum of Natural History collections (n¼ 403). Universal pri- lated the percentage of barcodes that were identified, and from mers were effective for sequencing individuals of most taxonomic the barcodes identified, we calculated the percentage of wrong groups, although there was considerable variation in success be- assignments to phylum. tween the two genetic markers (COI, 18S V1–2) across the differ- ent target groups (Table 1). The number of morphospecies Variability in the metabarcoding results identified by the taxonomic experts for each target taxonomic To examine the variability of taxonomic assignments associated group is reported in Table 1. The refined ID for each sample, cur- with methodological choices when assessing a zooplankton as- rent classifications following the WoRMS taxonomy, voucher semblage, we compared sequence assignments to the phylum level identification number (USNM), and GenBank accession numbers Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 Taxonomic expertise improves metabarcoding 7 and 18S within the phyla Echinodermata, Mollusca, and Sipuncula (these could be related to taxonomic backbone differ- ences, see Supplementary Figure S1); and for COI within the phyla Annelida and Chordata (Supplementary Figure S1). We also contributed COI sequences for 33 families (from 6 phyla) previously not represented, and 11 for 18S; sequences were con- tributed for 66 genera (from 8 phyla) previously not represented for COI, and 33 for 18S; and sequences were contributed for 63 species previously not represented for COI, and 46 for 18S (Supplementary Figure S1). Overall, the largest number of se- quence contributions were for the phyla Arthropoda, Cnidaria, and Mollusca. Validating taxonomic assignment methods using the StreamCode DNA barcode database The percentage of StreamCode barcodes identified to phylum ranged from 0 to 100%, depending on the genetic marker and the method used to identify taxa (Table 2). For both markers and most phyla, the BLASTn-GenBank method identified a larger per- centage of barcodes than the RDP Classifier, and in most cases at a smaller or zero error rate (Table 2). Using the local database (BLASTn-StreamCode) improved the percentage of identified taxa for most phyla, with the largest improvement observed with COI for the phylum Annelida (Table 2). Not all the barcodes identified to a phylum were assigned to the correct phylum; the percentage of wrong assignments varied with marker and method but was in general larger for COI than for 18S. All of the StreamCode phyla were identified using the 18S barcodes, with only small errors in the assignment to phylum for Annelida and Cnidaria (Table 2), and a failure to correctly identify Phoronida using the BLASTn-StreamCode approach. Since there was only one StreamCode DNA barcode available for the phylum Figure 1. Examples of StreamCode specimens. (a) Pelagobia sp. Phoronida, when that sequence was queried there were no other (Polychaeta), USNM 1450035; (b) Maupasia sp. (Polychaeta), USNM 1450037; (c) Peachia sp. (Cnidaria), USNM 1448832; (d) Cytaeis sp. sequences from the phylum Phoronida for comparison. (Cnidaria), USNM 1447971; (e) Phronima sp. (Arthropoda), USNM 1450286; (f) Clio recurva (Mollusca), USNM 1448342; (g) Euchirella Variability in metabarcoding results curticauda (Arthropoda), USNM 1448593; (h) Otoporpa sp. The total number of OTUs/ZOTUs identified to phylum differed (Cnidaria), USNM 1448490; (i) Copilia sp. (Arthropoda), USNM by marker, and methods used for a taxonomic assignment 1448598; (j) Abralia veranyi (Mollusca), USNM 1447996. (a–d) (Table 3). Regardless of the type of clustering or method for a Highlight of four StreamCode samples that represent new COI contributions of barcode quality at the genus level not previously taxonomic assignment, the proportion of unidentified taxa was represented in GenBank. smaller for 18S V1–2 than COI (18S: 15.3–40.8%; COI: 34.7– 75.0%). In general, the BLASTn-GenBank method returned a smaller number of unidentified taxa and identified a larger num- for the samples sequenced successfully are also included in ber of non-target phyla when compared with the other methods Supplementary Table S1. (Table 3). Phylum detection at the defined taxonomic thresholds varied Contributions to GenBank by marker, type of clustering, and method for a taxonomic as- The gap analysis allowed us to identify taxa for which signment. The phyla detected by all metabarcoding method com- StreamCode contributed new COI or 18S sequences to GenBank binations (and by the morphological analysis) were Annelida, as of 29 October 2020 (taxon names are detailed in Arthropoda, Bryozoa, Chaetognatha, Chordata, Cnidaria, Supplementary Figure S1). The COI results were specific to con- Echinodermata, Mollusca, and Sipuncula. Other phyla were tributions of barcode quality (e.g. Flaccisagitta enflata will count detected only by some combinations: Acanthocephala using 18S as a new contribution because the available sequence in GenBank V1–2; Porifera using COI with BLASTn-GenBank and by mor- is shorter than 500 bp), but the 18S results represent any contri- phological analysis; Nematoda using OTUs by BLAST-GenBank bution for that genetic marker without a predefined quality filter. and RDP Classifier. By definition, the phyla not represented in For COI, we contributed sequences for three classes that were the StreamCode DNA barcode database (Acanthocephala, previously not included in GenBank (Appendicularia and Nematoda, Nemertea, and Platyhelminthes) were unidentified us- Thaliacea within phylum Chordata, and Phascolosomatidea ing the BLASTn-StreamCode approach. within phylum Sipuncula). No new class-level contributions were The cases in which the only one of the methods used for a tax- made for 18S. Order level contributions were made for both COI onomic assignment was able to identify a phylum (unique Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 8 P. Pappalardo et al. Table 3. Summary of metabarcoding results. Marker Clustering Taxa RDP classifier BLASTn-GenBank BLASTn-StreamCode COI OTUs Zooplankton 1661 1986 1234 Unidentified 3283 2817 3712 Non-target 2 143 0 ZOTUs Zooplankton 5168 5910 4646 Unidentified 4080 3206 4603 Non-target 1 133 0 18S V1–2 OTUs Zooplankton 1892 1914 2935 Unidentified 1468 1453 663 Non-target 238 231 0 ZOTUs Zooplankton 2252 2570 2744 Unidentified 781 526 707 Non-target 418 355 0 Number of OTUs/ZOTUs identified to target zooplankton phylum, unidentified to phylum, or identified to non-target phylum for each genetic marker, type of clustering, and method used for taxonomic assignment. Figure 2. Unique OTUs contributions from each method for a taxonomic assignment (BLASTn-GenBank, BLASTn-StreamCode, RDP classifier), for (a) COI and (b) 18S V1–2. The RDP Classifier was used with the PR2 database for 18S and MIDORI 2 database for COI. Upper panels: counts for all the unique contributions to identify OTUs in each phylum. We added a constant of 0.5 when the number of unique contributions was 1, to be able to represent those values in the logarithmic scale. “Non-target” refers to taxa identified to phyla that do not belong to the target zooplankton groups. Lower panels: distance tree of all the barcodes and OTUs for Annelida and Arthropoda, coloured according to the identification method for each genetic marker sequence. When two or three metabarcoding methods agreed in the assignment to the phylum, we coded it “Agreement.” Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 Taxonomic expertise improves metabarcoding 9 contributions) varied by taxonomic group and genetic marker type, the second most diverse groups were chaetognaths, chor- (Figure 2). For COI OTUs, the BLASTn-StreamCode method dates, cnidarians, mollusks, or non-target phyla. (Supplementary generated more unique contributions for Annelida, Bryozoa, Table S6, Figure 3b). Chaetognatha, and Echinodermata (Figure 2a); whereas for 18S Combining all OTUs and ZOTUs to analyse agreement among V1–2 OTUs, the BLASTn-StreamCode method generated the methods, we observed that the three methods agreed on the phy- largest number of unique contributions for taxa within lum assignment in 55% of sequences for 18S and 68% of sequen- Arthropoda (Figure 2b). BLASTn-GenBank and RDP Classifier ces for COI. Agreements between the two methods were observed had more unique contributions within Chordata, Arthropoda in 42% of sequences for 18S and 32% of sequences for COI (for COI) and non-target phyla; BLASTn-GenBank and BLASTn- (Figure 4). The case of no agreement between methods was rare StreamCode had the largest number of unique contributions for (0.1–3%). However, in many cases, the methods agreed only in the phylum Cnidaria; and RDP Classifier (for COI) performed their inability to identify a specimen to phylum, e.g., in 53% of best to identify taxa from the phylum Platyhelminthes (Figure 2). the cases all three methods were unable to identify a phylum for A distance tree including StreamCode barcodes and OTUs for COI at the specified thresholds (Figure 4). When all three meth- taxa assigned to Arthropoda and Annelida is presented in ods assigned a specific phylum, 98.9% of 18S V1–2 sequences Figure 2 (lower panel), similar trees for both markers and other agreed on phylum and 99.8% of COI sequences had matching phyla are presented in Supplementary Figure S3. The unique con- phylum assignments. tributions to identify ZOTUs were fairly similar and are presented in Supplementary Figure S2. Plankton types Arthropods (predominantly crustaceans) were the most di- After removing unidentified matches or matches to non-target verse group in our plankton samples for all markers and cluster- phyla, we found that most metabarcoding samples consisted of ing schemes (Figure 3a). Depending on the marker and clustering holoplankton, in agreement with morphological analyses Figure 3. Variability in the proportion of OTUs/ZOTUs belonging to each phylum identified using different genetic markers (COI and 18S V1–2) and different approaches for taxonomic assignment (BLASTn-GenBank, BLASTn-StreamCode, RDP Classifier). (a) All samples (unidentified samples and samples identified to phylum). (b) Only samples identified to phylum without including the phylum Arthropoda. The RDP Classifier was used with the PR2 database for 18S and MIDORI database for COI. “Non-target” refers to taxa identified to phyla that do not belong to the target zooplankton groups, “Unidentified” are the OTUs/ZOTUs without a phylum assignment at the defined taxonomic thresholds. The more abundant taxa are highlighted with black silhouettes. Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 10 P. Pappalardo et al. errors in taxonomic assignment (Santoferrara, 2019). To mini- mize errors and improve taxonomic resolution, the implementa- tion of metabarcoding methods and compilation of reference databases should be associated with taxonomic experts (Clarke et al., 2017; Porter and Hajibabaei, 2018). By integrating tradi- tional taxonomic assessment, barcoding, and metabarcoding, we have highlighted how a targeted effort to develop a regional DNA barcode database can help improve metabarcoding taxonomic assignments. Moreover, we showed that the chances of correctly identifying taxa belonging to different marine zooplankton phyla varied by genetic marker, the type of clustering, and the approach used for taxonomic assignment. During the short timeframe of the StreamCode project (one field season, two collection weeks), a small collaboration of taxo- nomic experts made a large contribution of barcode quality sequences to GenBank (1108 for 18S and 1042 for COI belonging to 15 phyla) as well as images and voucher information (Supplementary Table S1). Hopefully, this encourages similar work that targets taxa underrepresented in reference databases. Local bioblitz efforts (Plumb, 2014, where there is intense biodi- versity sampling in a small region carried out by experts and vol- unteers) coupled with barcoding efforts are becoming popular because they are essential for enhancing reference databases. Large international efforts such as the Marine Barcode of Life Figure 4. Frequency of agreement in the identification to phylum project have contributed voucher information and barcode between the different methods for taxonomic assignment for each sequences to the Barcode Of Life (BOLD) system (Puillandre of the two genetic markers (18S V1–2 and COI). Colours separate et al., 2012). Barcoding efforts led by non-taxonomists would the cases in which all methods agree, two methods agree, and zero benefit from including taxonomic experts at the beginning of the methods agree. For the cases in which two or three methods agree, project, incorporating their expertise for sampling design and col- the percentage of agreement in the failure to identify a phylum is lection of samples (DeWalt, 2011), and also for subsequent qual- presented within the bar. ity control checks and determination of unidentified taxa. Both targeted efforts and large collaborations are critical to improving (Figure 5). Although found at a comparatively smaller propor- the quality and taxonomic coverage of reference DNA databases, tion, a meroplankton component was detected across all plankton which in turn will make metabarcoding a more efficient tool for samples, with a larger proportion detected using morphology and characterizing communities captured in environmental samples. the BLASTn-StreamCode method in the metabarcoding samples. Developing local DNA sequence libraries helps to overcome For this analysis, we used the classification to lower taxonomic problems associated with incomplete publically available DNA levels when available; however, many of the metabarcoding sam- reference databases (and with incorrect entries), one of the cur- ples lacked enough resolution to classify taxa into holoplankton rent limitations of resolving taxonomic assignment with metabar- or meroplankton (coded N/A in Figure 5). For each combination coding (Porter and Hajibabaei, 2018). Every new sequence counts of clustering and genetic marker, the BLASTn-StreamCode because the addition of even a few species to a local reference method produced the smallest number of taxa unassigned to database can improve metabarcoding taxonomic assignments plankton type (Figure 5). In Supplementary Figure S4, we detail substantially (Abad et al., 2016); so when a large local barcode the phyla composition for holoplankton and meroplankton iden- database is used, the improvement in taxonomic identification tified using all methods. can be immense (e.g. up to 7-fold increase in Ransome et al., 2017). In our study, by using the StreamCode DNA barcode data- Discussion base (BLASTn-StreamCode method), we identified more unique Biodiversity assessments performed with metabarcoding methods taxa to phyla that are likely less represented in reference databases indicate astonishing levels of diversity in biological communities. (e.g. Annelida, Bryozoa, Chaetognatha; the first two also identi- Despite the promising potential of metabarcoding methods for fied by Ransome et al., 2017 as under-represented taxa). Working standardized biodiversity surveys (Baird and Hajibabaei, 2012; toward a comprehensive genetic inventory of marine life will not Deiner et al., 2017), our study highlights that even identification only reduce the number of unidentified sequences in metabar- to phylum is not straightforward for understudied ecosystems code analyses, but it will also help focus taxonomic efforts by re- with poor representation in DNA reference databases. Identifying vealing undescribed species in poorly studied groups. metabarcoding samples at taxonomic levels below phylum is criti- To avoid the problem of inadequately identified records (e.g. cal for ecological studies (e.g. biogeography, connectivity, func- “environmental sample” or “uncultured metazoan”), researchers tional groups, biological indicators, parasitism, invasive species), that apply metabarcoding methods can opt for curated databases yet, our case study demonstrates how many metabarcoding sam- that include only sequences that have passed a quality check and ples cannot even be identified to phylum or plankton type. have complete taxonomic information. In some cases, curated Because bias in taxonomic identification with metabarcoding databases also require sequences to be associated with voucher propagates to downstream analyses, it is important to minimize specimens (e.g. BOLD, http://barcodinglife.org/), which allows Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 Taxonomic expertise improves metabarcoding 11 Figure 5. Proportion of taxa identified as members of the holoplankton, meroplankton, or others (mixed life histories within a group, or a benthic group) for the metabarcoding results with two genetic markers (COI, 18S V1–2) and three methods for taxonomic assignment, combined with the results from the morphological analysis. (a) Metabarcoding-OTUs, (b) metabarcoding-ZOTUs, and (c) morphological analysis. N/A indicates samples that lacked enough taxonomic resolution to be classified into holoplankton or meroplankton. updates to identifications as taxonomic information continues to unidentified sequences for 18S V1–2 than for COI, regardless of grow. Even though curated databases are not comprehensive, the method used for a taxonomic assignment. Other studies using they are continuously improving and there are many efforts un- COI have also found a large number of unidentified samples derway to develop curated databases for different taxonomic (Leray and Knowlton, 2015; Ransome et al., 2017). To better re- groups and genetic markers (EukRef, http://eukref.org/; flect marine biodiversity, some authors recommend a first step MetaZooGene, https://metazoogene.org/; PR2, https://pr2-data that involves a conserved marker useful for coarse taxonomic base.org/). Despite the higher reliability of data from curated assignments (e.g. 18S V1–2 or V9), and a second step that targets databases, for a coarse identification of marine zooplankton taxa a highly variable marker (e.g. COI) to provide finer taxonomic to phylum, we showed that the BLASTn-GenBank approach out- resolution (Leray and Knowlton, 2016). In our study, there were performed PR2 and Midori databases in assigning a phylum to also differences between markers in the relative taxonomic com- most OTUs. This probably happens because PR2 and Midori: position detected (as also observed by Cowart et al., 2015; (i) lag behind GenBank in incorporating new sequences; (ii) in- Djurhuus et al., 2018), and in the proportion of taxa that we were clude only sequences identified to genus or species level. Finally, able to classify into plankton type. Both of these sources of varia- the BLASTn-Streamcode method using our local database out- tion could have large impacts on the interpretation of community performed GenBank in some groups, likely because of our more comparisons in ecological studies. Regardless of whether different complete coverage of relevant taxa (e.g. Annelida: Polychaeta). genetic markers can vary in their specificity to detect some taxa, Additionally, we have observed that the StreamCode DNA bar- or the ability to identify samples with different taxonomic resolu- code database improves resolution at lower taxonomic levels as tion, multiple markers may still reflect similar trends in spatial di- well (Pappalardo et al., 2020; ESA Abstract). versity (Pitz et al., 2020). Importantly, given that we detected some phyla with only one An important consideration when using a multi-marker ap- genetic marker, we support the idea that a multi-marker ap- proach is to apply a common classification scheme across markers proach can be more effective in studies with a broader scope and DNA reference databases. In our marine zooplankton-fo- (Cowart et al., 2015; Bucklin et al., 2016; Leray and Knowlton, cused analyses, we standardized the classification scheme used by 2016; Djurhuus et al., 2018). Both when we tested the Midori (based on NCBI) and the PR2 database (custom classifica- StreamCode barcodes (Table 2) and when we analysed the meta- tion defined by experts) to that in WoRMS. Even though some barcoding data (Table 3), we found a smaller proportion of aspects of the current WoRMS classification are still under debate Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 12 P. Pappalardo et al. (e.g. placement of Sipuncula as a phylum), their large editorial mentioned specific thresholds for different taxonomic levels as a group reviews newer publications and updates their classification “rough proxy” (order: 85%, family: 90%, genus: 95%, and spe- when there is enough support. WoRMS also provides a useful cies: 98%), but did not specify how the thresholds were deter- tool for matching taxon names (https://www.marinespecies.org/ mined. In metabarcoding samples containing taxa from multiple aphia.php?p¼match) and overall, is the most current and author- phyla, uniform thresholds across different phyla may not be ad- itative resource for the classification of marine taxa. It would be visable, due to different evolutionary rates in different taxonomic ideal if genetic data resources (e.g. NCBI, BoLD) could adopt a groups. One example of this variation in COI can be seen in the global taxonomic system (such as the Catalogue of Life, https:// range of confidence thresholds needed by RDP Classifier to ob- www.catalogueoflife.org/), provided that the classification hierar- tain a correct assignment for each taxonomic level in different chy is maintained for all groups with the best data sources. For phyla of marine organisms (results from Midori “leave-one-out- example, even though the Catalogue of Life follows WoRMS for test”, available in http://reference-midori.info/download.php#). marine organisms, their online website was not updated with the More research is needed on how to define confidence thresholds current WoRMS edition at the time of submitting this article for assignment to lower taxonomic levels when using metabar- (December, 2020). In the meantime, it falls on researchers to ad- coding methods, and how those decisions affect results. dress taxonomic standards in multi-marker studies. Individual Increasingly, the computational methods being developed for researchers and research groups could still adopt their own taxon- analysing metabarcode data focus on “micro-scale” variation, omies, even rankless ones (given the arbitrary nature of Linnean such as strain-level variation (e.g. UNOISE, DADA2). These algo- ranks) such as PhyloCode (i.e. PhyloCode, http://phylonames. rithms were developed with the primary goal of examining bacte- org/code/). rial diversity and community structure, as bacteria are not Our data also indicate that exploring different clustering meth- traditionally categorized as species. In fact, many microbiologists ods can improve taxon detection. When a phylum was detected explore strain-level variation in bacteria, as it can be used to elu- only in some of the combinations of clustering, genetic marker, cidate differences in virulence, growth patterns, etc. In contrast, and method for taxonomic assignment (e.g. Acanthocephala, researchers examining metazoan communities to assess commu- Porifera, Hemichordata, see Supplementary Table S6), it was usu- nity diversity, identify potential invasive species, and examine tro- ally represented by a few (sometimes only one) OTUs or ZOTUs. phic-level interactions, are generally more interested in We found two possible explanations for these differences in phy- examining inter-specific diversity. Thus, extending these “micro- lum detection. Differences could appear because of the different scale” analytical tools to examination of metazoan sequences for filtering algorithms used to generate OTUs and ZOTUs. For ex- inter-specific diversity, particularly without somewhat arbitrarily ample, Schenk et al. (2020) found differences between OTUs and assigning a species-level sequence threshold for all taxa, can re- ASVs (¼ZOTUs) in metabarcoding of freshwater nematode com- quire substantial additional work. On the simpler end of the spec- munities and suggested as an explanation the more stringent fil- trum, this includes building phylogenetic trees to confirm tering in the pipeline prior to generating ASVs. In addition, many “species-level” clustering, and on the other end involves incorpo- of the taxa found using only OTUs by Schenk et al. (2020) were rating sophisticated species delimitation methods (e.g. multi-rate detected with a small percentage of sequence reads. For our data, Poisson tree processes, Kapli et al., 2017). Development of analyt- another explanation relates to the pre-defined confidence thresh- ical tools that can be used to explore metazoan diversity, particu- olds to assign a phylum. We noticed cases in which for the raw larly taking into account sequence variation across taxonomic data a phylum was assigned for both OTUs and ZOTUs, but with groups, is needed. different confidence thresholds, and only one of the clustering schemes passed the pre-defined taxonomic thresholds. Singletons Conclusion are commonly filtered out in metabarcoding studies given the If identification of different taxa is critical for a metabarcoding likelihood that they represent artefacts, but some authors argue study, we recommend: (i) using multiple genetic markers, that they could be important for the detection of rare species (ii) implementing multiple methods for a taxonomic assignment, (Brown et al., 2015). Future research using mock communities (iii) clustering the data into different types of molecular opera- could contribute to our understanding of these topics; in the tional units, and more importantly, (iv) collaborating with taxo- meantime, we suggest implementing different clustering nomists to develop a regional database of the groups of interest, approaches if taxon identification is important. especially if they are underrepresented in reference databases. Recommended confidence thresholds for accurate taxonomic This multi-analysis approach can enhance taxon detection and assignment depend on the completeness of the reference database increase confidence in the results. There are already many such for the focal taxon (Porter and Hajibabaei, 2018). We followed collaborations underway that are collectively populating reference recommended guidelines to assign taxa to phylum (e.g. Ransome databases and will improve the performance of taxonomic assign- et al., 2017; Leray et al., 2018). However, there are no agreed- ment in biodiversity surveys using metabarcoding. In addition, upon criteria at lower taxonomic levels. For example, for meta- new tools and pipelines are continuously being developed, many barcoding of 18S using the CREST LCAClassifier algorithm, of them in open-access platforms. To improve taxonomic assign- Lanzén et al. (2012) used minimum similarities between related ments at lower taxonomic levels (species, genus, family), we think taxa and cross-validation from reference datasets to propose spe- that future research should aim to develop taxon-specific thresh- cific thresholds for each taxonomic level (phylum: 80%, class: olds for different genetic markers, to account for the different 85%, order: 90%, family: 95%, genus: 97%, and species: 99%); evolutionary rates in different taxonomic groups. whereas Leasi et al. (2018) used BLAST when analysing the 18S V9 region and chose different thresholds based on the literature Supplementary data and analysis of mock communities (phylum: 90%, family: 93%, Supplementary material is available at the ICESJMS online ver- species: >97%). For metabarcoding of COI, Elbrecht et al. (2017) sion of the manuscript. Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 Taxonomic expertise improves metabarcoding 13 Funding analyses, and expanded StreaCode data including the sampling This work was funded through a Smithsonian National Museum locations is available in the Dryad Digital Repository at DOI of Natural History (NMNH) Associate Director for Science Core https://doi.org/10.5061/dryad.tdz08kpzx. Grant (2017), with additional support from a Smithsonian Global Genome Initiative grant (GGI-Rolling-2017-109a). We References also wish to acknowledge funding and technical support from the Abad, D., Albaina, A., Aguirre, M., Laza-Martı́nez, A., Uriarte, I., Iriarte, A., Villate, F., et al. 2016. Is metabarcoding suitable for es- Smithsonian Institution Barcode Network (SIBN, FY2017 Award tuarine plankton monitoring? A comparative study with micros- cycle). Katrina M. Pagenkopp Lohan was funded as a Robert and copy. Marine Biology, 163: 149. Arlene Kogod Secretarial Scholar. Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. 1990. Basic local alignment search tool. Journal of Molecular Acknowledgements Biology, 215: 403–410. Genetic benchwork and sequencing was completed at the Aylagas, E., Borja, Á., Irigoien, X., and Rodrı́guez-Ezpeleta, N. 2016. Smithsonian NMNH Laboratories of Analytical Biology (LAB). Benchmarking DNA metabarcoding for biodiversity-based moni- We thank NMNH specialists Rebecca Dikow, Mike Trizna, and toring and assessment. Frontiers in Marine Science, 3: http://jour Matthew Kweskin for their assistance with the computations us- nal.frontiersin.org/Article/10.3389/fmars.2016.00096/abstract (last accessed 11 January 2020). ing the Smithsonian High Performance Cluster (SI/HPC, Baird, D. J., and Hajibabaei, M. 2012. Biomonitoring 2.0: a new para- Smithsonian Institution, https://doi.org/10.25572/SIHPC), and digm in ecosystem assessment made possible by next-generation Amanda Devine for assistance using the GGI Gap Analysis Tool. DNA sequencing. Molecular Ecology, 21: 2039–2044. We also thank Museum Specialist Abigail Reft for her assistance Bazinet, A. L., and Cummings, M. P. 2012. A comparative evaluation with the cephalopods data, and Julia Steier for her helpful edito- of sequence classification programs. BMC Bioinformatics, 13: 92. rial review on the final version. We thank the crew of the M/V Bhadury, P., Austen, M., Bilton, D., Lambshead, P., Rogers, A., and Thunderforce for technical support during zooplankton collec- Smerdon, G. 2006. Development and evaluation of a tions. Additionally, we are grateful to Smithsonian Marine DNA-barcoding approach for the rapid identification of nemato- Station laboratory and field support staff, Sherry Reed, Woody des. Marine Ecology Progress Series, 320: 1–9. Lee, David Branson, and Scott Jones for their assistance, to Brown, E. A., Chain, F. J. J., Crease, T. J., MacIsaac, H. J., and NMNH administrative support Carol Youmans, Marisol Cristescu, M. E. 2015. Divergence thresholds and divergent biodi- versity estimates: can metabarcoding reliably describe zooplank- Arciniega-Melendez, and Joan Kaminski, and to Lisa Comer and ton communities? Ecology and Evolution, 5: 2234–2251. Mark Lehtonen for collections management support. This publi- Bucklin, A., Lindeque, P. K., Rodriguez-Ezpeleta, N., Albaina, A., and cation is Smithsonian Marine Station contribution no. 1156. Lehtiniemi, M. 2016. Metabarcoding of marine zooplankton: prospects, progress and pitfalls. Journal of Plankton Research, 38: Author’s contributions 393–400. MJB and KJO conceived the StreamCode project idea, secured Callahan, B. J., McMurdie, P. J., and Holmes, S. P. 2017. Exact se- the initial funding and led data collection in the field. KJO devel- quence variants should replace operational taxonomic units in oped the workflows and record keeping for the fieldwork and se- marker-gene data analysis. The ISME Journal, 11: 2639–2643. cured additional funding. All authors excluding AB, KM, NER, Caporaso, J. G., Kuczynski, J., and Knight, R. 2010. QIIME allows PP, and KMLP, participated in the fieldwork and morphological analysis of high-throughput community sequencing data. Nature Methods, 7: 335–336. classification. AGC, KMH, KJO, MJB, MV, PP, and WJ, per- Carew, M. E., Kellar, C. R., Pettigrove, V. J., and Hoffmann, A. A. formed quality controls on refined IDs. SBT generated the 18S 2018. Can high-throughput sequencing detect macroinvertebrate V1–2 barcodes, KM and NER generated the COI barcodes. MJB diversity for routine monitoring of an urban river? Ecological processed all samples for metabarcoding. KMLP developed and Indicators, 85: 440–450. implemented the bioinformatics pipeline, KMLP and PP worked Clarke, L. J., Beard, J. M., Swadling, K. M., and Deagle, B. E. 2017. on taxonomic assignments for sequence data. AB contributed Effect of marker choice and thermal cycling protocol on zoo- with the gap analysis and metadata of StreamCode barcodes. PP plankton DNA metabarcoding studies. Ecology and Evolution, 7: designed and conducted data analyses and wrote the first draft. 873–883. KJO, KMH, KMLP, and AGC, contributed critically to the data Cowart, D. A., Pinheiro, M., Mouchel, O., Maguer, M., Grall, J., Miné, J., and Arnaud-Haond, S. 2015. Metabarcoding is powerful analysis and interpretation of results. AGC, EES, JAG, KJO, yet still blind: a comparative analysis of morphological and mo- KMH, KMLP, MJH, MJB, MV, NER, SLB, and WJ provided feed- lecular surveys of seagrass communities. PLoS One, 10: e0117562. back on the manuscript. All authors approved the final Cristescu, M. E. 2014. From barcoding single individuals to metabar- manuscript. coding biological communities: towards an integrative approach to the study of global biodiversity. Trends in Ecology & Data availability Evolution, 29: 566–571. The StreamCode DNA barcode sequences underlying this article Deiner, K., Bik, H. M., Mächler, E., Seymour, M., Lacoursière- are available in the GenBank Nucleotide Database at https://www. Roussel, A., Altermatt, F., Creer, S., et al. 2017. Environmental ncbi.nlm.nih.gov/genbank/, and can be accessed with the acces- DNA metabarcoding: transforming how we survey animal and plant communities. Molecular Ecology, 26: 5872–5895. sion numbers provided in Supplementary Table S1; sequences DeWalt, R. E. 2011. DNA barcoding: a taxonomic point of view. are also associated to NCBI BioProject PRJNA421480. The Journal of the North American Benthological Society, 30: StreamCode metabarcoding raw data is available through the 174–181. NCBI Sequence Read Archive and included in the NCBI Djurhuus, A., Pitz, K., Sawaya, N. A., Rojas-Márquez, J., Michaud, B., BioProject PRJNA421480. The raw data includes the V9 region of Montes, E., Muller-Karger, F., et al. 2018. Evaluation of marine 18S that was not part of this analysis. The R code used for the zooplankton community structure through environmental DNA Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021 14 P. Pappalardo et al. metabarcoding: metabarcoding zooplankton from eDNA. Pagenkopp Lohan, K. M., Fleischer, R. C., Carney, K. J., Holzer, K. Limnology and Oceanography: Methods, 16: 209–221. K., and Ruiz, G. M. 2016. Amplicon-based pyrosequencing reveals Edgar, R. C. 2013. UPARSE: highly accurate OTU sequences from high diversity of protistan parasites in ships’ ballast water: impli- microbial amplicon reads. Nature Methods, 10: 996–998. cations for biogeography and infectious diseases. Microbial Edgar, R. C. 2016. UNOISE2: improved error-correction for Illumina Ecology, 71: 530–542. 16S and ITS amplicon sequencing. bioRxiv, 081257. https://www. Pappalardo, P., Pagenkopp Lohan, K. M., Boyle, M. J., Collins, A. G., semanticscholar.org/paper/UNOISE2%3A-improved-error-correc Hanson, K. M., Truskey, S. B., Jaekle, W., et al. 2020. Improving tax- tion-for-Illumina-16S-Edgar/a9867e89687a6fb581e664c8880e743 onomic assignment of DNA metabarcoding with taxonomic exper- 8af8c8b5a tise. Ecological Society of America Annual Meeting, Aug 3-6. https:// Edgar, R. C. 2018. Updating the 97% identity threshold for 16S ribo- eco.confex.com/eco/2020/meetingapp.cgi/Paper/87812. somal RNA OTUs. Bioinformatics, 34: 2371–2375. Parry, L. A., Edgecombe, G. D., Eibye-Jacobsen, D., and Vinther, J. Elbrecht, V., Vamos, E. E., Meissner, K., Aroviita, J., and Leese, F. 2016. The impact of fossil data on annelid phylogeny inferred 2017. Assessing strengths and weaknesses of DNA metabarco- from discrete morphological characters. Proceedings of the Royal ding-based macroinvertebrate identification for routine stream Society B. Biological Sciences, 283: 20161378. monitoring. Methods in Ecology and Evolution, 8: 1265–1275. Pitz, K. J., Guo, J., Johnson, S. B., Campbell, T. L., Zhang, H., Guillou, L., Bachar, D., Audic, S., Bass, D., Berney, C., Bittner, L., Vrijenhoek, R. C., Chavez, F. P., et al. 2020. Zooplankton biogeo- Boutte, C., et al. 2012. The Protist Ribosomal Reference database graphic boundaries in the California Current System as deter- (PR2): a catalog of unicellular eukaryote Small Sub-Unit rRNA mined from metabarcoding. PLoS One, 15: e0235159. sequences with curated taxonomy. Nucleic Acids Research, 41: Porter, T. M., and Hajibabaei, M. 2018. Scaling up: a guide to high- D597–D604. throughput genomic approaches for biodiversity analysis. Horton, T., Kroh, A., Ahyong, S., Bailly, N., Boyko, C. B., Brand~ao, S. N., Molecular Ecology, 27: 313–338. Gofas, S., et al. 2020. World Register of Marine Species (WoRMS). Puillandre, N., Bouchet, P., Boisselier-Dubayle, M.-C., Brisset, J., WoRMS Editorial Board. https://www.marinespecies.org. Buge, B., Castelin, M., Chagnoux, S., et al. 2012. New taxonomy Kapli, P., Lutteropp, S., Zhang, J., Kobert, K., Pavlidis, P., Stamatakis, and old collections: integrating DNA barcoding into the collection A., and Flouri, T. 2017. Multi-rate Poisson tree processes for single-- curation process. Molecular Ecology Resources, 12: 396–402. locus species delimitation under maximum likelihood and Markov Quast, C., Pruesse, E., Yilmaz, P., Gerken, J., Schweer, T., Yarza, P., chain Monte Carlo. Journal of Plankton Research, 41: 571–582. Peplies, J., et al. 2012. The SILVA ribosomal RNA gene database Lanzén, A., Jørgensen, S. L., Huson, D. H., Gorfer, M., Grindhaug, S. project: improved data processing and web-based tools. Nucleic H., Jonassen, I., Øvreås, L., et al. 2012. CREST – classification Acids Research, 41: D590–D596. resources for environmental sequence tags. PLoS One, 7: e49334. R Core Team. 2020. R: A Language and Environment for Statistical Leasi, F., Sevigny, J. L., Laflamme, E. M., Artois, T., Curini-Galletti, Computing. R Foundation for Statistical Computing, Vienna, M., de Jesus Navarrete, A., Di Domenico, M., et al. 2018. Austria. https://www.R-project.org/. Biodiversity estimates and ecological interpretations of meiofau- Ransome, E., Geller, J. B., Timmers, M., Leray, M., Mahardini, A., nal communities are biased by the taxonomic approach. Sembiring, A., Collins, A. G., et al. 2017. The importance of stan- Communications Biology, 1: 112. dardization for biodiversity comparisons: a case study using au- Lejzerowicz, F., Esling, P., Pillet, L., Wilding, T. A., Black, K. D., and tonomous reef monitoring structures (ARMS) and Pawlowski, J. 2015. High-throughput sequencing and morphol- metabarcoding to measure cryptic diversity on Mo’orea coral ogy perform equally well for benthic monitoring of marine reefs, French Polynesia. PLoS One, 12: e0175066. ecosystems. Scientific Reports, 5: 13932. Ratnasingham, S., and Hebert, P. D. N. 2007. BOLD: the barcode of Leray, M., and Knowlton, N. 2015. DNA barcoding and metabarcod- life data system (http://www.barcodinglife.org). Molecular ing of standardized samples reveal patterns of marine benthic di- Ecology Notes, 7: 355–364. versity. Proceedings of the National Academy of Sciences, 112: 2076–2081. Richardson, R. T., Bengtsson-Palme, J., and Johnson, R. M. 2017. Evaluating and optimizing the performance of software com- Leray, M., and Knowlton, N. 2016. Censusing marine eukaryotic di- versity in the twenty-first century. Philosophical Transactions of monly used for the taxonomic classification of DNA metabarcod- the Royal Society B: Biological Sciences, 371: 20150331. ing sequence data. Molecular Ecology Resources, 17: 760–769. Leray, M., Ho, S.-L., Lin, I.-J., and Machida, R. J. 2018. MIDORI Santoferrara, L. F. 2019. Current practice in plankton metabarcoding: server: a webserver for taxonomic assignment of unknown meta- optimization and error management. Journal of Plankton zoan mitochondrial-encoded sequences using a curated database. Research, 41: 571–582. Bioinformatics, 34: 3753–3754. Schenk, J., Kleinbölting, N., and Traunspurger, W. 2020. Comparison Lindeque, P. K., Parry, H. E., Harmer, R. A., Somerfield, P. J., and of morphological, DNA barcoding, and metabarcoding character- Atkinson, A. 2013. Next generation sequencing reveals the hidden izations of freshwater nematode communities. Ecology and diversity of zooplankton assemblages. PLoS One, 8: e81327. Evolution, 10: 2885–2899. Lobo, J., Shokralla, S., Costa, M. H., Hajibabaei, M., and Costa, F. O. Struck, T. H., Paul, C., Hill, N., Hartmann, S., Hösel, C., Kube, M., 2017. DNA metabarcoding for high-throughput monitoring of Lieb, B., et al. 2011. Phylogenomic analyses unravel annelid evolu- estuarine macrobenthic communities. Scientific Reports, 7: 15618. tion. Nature, 471: 95–98. Machida, R. J., Leray, M., Ho, S.-L., and Knowlton, N. 2017. Tang, C. Q., Leasi, F., Obertegger, U., Kieneke, A., Barraclough, T. Metazoan mitochondrial gene sequence reference datasets for tax- G., and Fontaneto, D. 2012. The widely used small subunit 18S onomic assignment of environmental samples. Scientific Data, 4: rDNA molecule greatly underestimates true diversity in biodiver- 170027. sity surveys of the meiofauna. Proceedings of the National Pagenkopp Lohan, K., Campbell, T. L., Guo, J., Wheelock, M., Academy of Sciences, 109: 16208–16212. DiMaria, R. A., and Geller, J. B. 2019. Intact vs. homogenized Wang, Q., Garrity, G. M., Tiedje, J. M., and Cole, J. R. 2007. Naive subsampling: testing impacts of pre-extraction processing of Bayesian classifier for rapid assignment of rRNA sequences into multi-species samples on invasive species detection. Management the new bacterial taxonomy. Applied and Environmental of Biological Invasions, 10: 324–341. Microbiology, 73: 5261–5267. Handling editor: William Grant Downloaded from https://academic.oup.com/icesjms/advance-article/doi/10.1093/icesjms/fsab082/6277123 by smithsonia5 user on 18 May 2021