1 2 DR. MICHAEL G. CAMPANA (Orcid ID : 0000-0003-0461-6462) 3 4 5 Article type : Resource Article 6 7 8 9 10 11 12 13 14 15 16 BaitsTools: software for hybridization capture bait dne sig 17 18 Michael G. Campan a 19 20 Center for Conservation Genomics, Smithsonian Conservation BiIonlostgityu te, 3001 21 Connecticut Avenue NW, Washington, DC 20008, USA This is the author manuscript accepted for publication and has undergone full peer review but has not been through the copyediting, typesetting, pagination and proofreading process, which may lead to differences between this version and the Version of Record. Please cite this article as doi: 10.1111/1755-0998.12721 This article is protected by copyright. All rights reserved Author Manuscript 22 23 Keywords: hybridization capture, bait, targeted sequencing, software, nucleic acid 24 25 Corresponding author: Michael G. Campana 26 Center for Conservation Genomics, Smithsonian Conservation Biology Institute, 3001 27 Connecticut Avenue NW, Washington, DC 20008, USA. 28 Fax: +1 202-6331-237; E-mail: campanam@si.ed u 29 30 Running title: BaitsTools 31 32 Abstract 33 Nucleic acid hybridization capture is a principal technology in moleculaor geyc oalnd genomics. 34 Bait design, however, is a notrniv-ial task and few resources currently exist to automate the 35 process. Here, I present BaitsTools, an o-spoeunrce, use-frriendly software package tfoa cilitate 36 the design of nucleic acid baits for hybridization capture. 37 38 Introduction 39 Targeted hig-hthroughput sequencing using hybridization capture (e.g. Genti rakle. 2009)i s a 40 critical tool in molecular ecology angde nomics. Applications include genomic investigations of 41 non-model organisms using ultcrao-nserved elements (Faircloetth a l. 2012; Lim & Braun 2016), 42 exome capture (e.gN.g et al. 2009), single nucleotide polymorphism (SNP) analysis (e.g. 43 Burbanoe t al. 2010), targeted metagenomics (e.g. Campeta anla. 2016),a nd ancient DNA 44 enrichmenta nd museomic(se .g. Burbanoe t al. 2010;H awkins et al. 2016; Lim & Braun 2016), 45 among othersH. ybridization capture utilizes oligonucleotide baitse ntor ich targe tmolecules 46 from nucleic acid librarie sthrough hybridization of the baits to complementary nucleotide 47 sequences in the libraries, isolation of the hybridized molecules, and removal onf -tthaerg neot 48 library molecules M. anual bait desig nis non-trivial, and fews oftware packages are pubyli cl 49 available for this tas k(see, for instance, Faircloth 2017). Here, I driebsec BaitsTools, an open 50 sourcep ackage to design anind silico test bait sequences for a variety of hybridization capture 51 applications . This article is protected by copyright. All rights reserved Author Manuscript 52 53 BaitsTools functions 54 BaitsTools generaste high-quality oligonucleotide baits from a variety of input formats using 55 input-specific subcommand(sT able 1). Subcommand parameters are user customizable with 56 defaults suited for generating 120 bp RNA baits (such as MYbafriotsm® MYcroarray). 57 Currently, BaitsTools can generate baits from FASTA/FASTQ sequences and alignments, Stacks 58 (Catchene t al. 2011, 201)3 populations ummarys tatisticsf iles, genome annotationasn d feature s 59 (BED/GTF/GFF),P yRAD and ipyrad loci files (Eaton 2014), and VCF filTehse. software can 60 also analyze and filter previously generated bait sequences using the checkbaits sub.c ommand 61 BaitsTools utilizes a thre-setep workflow: variant selection, bait generation, and bait quality 62 control and filtration (Figur e1). Depending on the selected subcommand user requiremen, ts 63 some of these steps can be omit tBeadi.tsTools can output detailed log files giving locus- and 64 subcommands-pecific results for each of these ste ps. 65 66 Variant selection 67 Genome sequencing and reducrepdr-esentation approaches (such as RADseq) often discover 68 orders of magnitude more sequence variants thatny paircea lly analyzed i ngenomic projects 69 using hybridization capture. BaitsTools can select variants from, VPCyRFAD and ipyrad LOCI 70 files, and Stacks population summary statistics fileisd eton tify a subset of variants evenly spaced 71 across genomes. Genome assemblies vary isciagntilfy in quality– ranging between a selection 72 of assembled reduc-erdepresentation loci to conti,g s-caffold- or chromosome-level whole- 73 genome assemblies. Hereafter, individcuoaml ponenta ssembled sequences are referred to as 74 ‘contigs’ for simplicity. To ensure even spacing across reference sequences of varying quality, 75 the user cans elec ta maximum number of variants per contig coar n scale the number of selected 76 variantsp er individual contig by its length. The first option is useful for highly fragmented 77 assemblies or reduc-eredpresentation datasets without a genome asse inm obrlyder to sample as 78 many genomic markers as poss,i bwlehereas the latter is appropriate for hi-gqhuality genome 79 assemblies where most polymorphisms are locatehde o lno nt gesct ontigs. The user can also 80 specify a minimum physical distance between selected variants to ensure equal coverage across 81 the reference sequences amnidti gate linkage disequilibrium A.dditionally, stacks2baits can sort This article is protected by copyright. All rights reserved Author Manuscript 82 polymorphisms varyingw ithin or between populations and by deviatiofrno m Hardy-Weinberg 83 equilibrium according to χa2 test. Finally,v cf2baits can exclude sequence variants below a 84 minimum specifiedP hred-like quality score. 85 86 Selected variants are output in a new VCF or Stacks summaryf otar bthlee vcf2baits and 87 stacks2baits commands respectively. Furthermore, the BaitsTools -vsaerlieacnttion and ba-it 88 generation options are appended to the end of the VCF header for future user re ference. 89 90 Bait generation 91 To generate candidate bait sequences, BaitsTools imports reference sequences in FASTA/FASTQ 92 or PyRAD/ipyrad LOCIf ormat. For the aln2baits and tilebaits commands, the input nucleotide 93 alignments ors equence list are treated as reference sequences. BaitsTools can also generate baits 94 across the break in linearized circular sequences (e.g. complete mitogenomes in FASTA format). 95 Appending ‘#circ’ to the end of a sequence header indicates to BaitsTools that a es eisq uenc 96 circular. Otherwise, BaitsTools assumes linear seque nces. 97 98 After reference sequence importation, baits are generated according to each subcommand’s 99 algorithm. For vcf2baits and stacks2baits, the regions surrounding the selectnetd a vraer ia 100 extracted from the reference sequence using genomic coordinates. The extracted region is 101 determined by thesp ecified bait length, tiling density, and position of the selected variwanitthsi n 102 the candidate bait. Optionally, alternate alleles are then applied otob ttahinee d bait sequencetso 103 producea balanced bait seret presentin gall known alleles equally. 104 105 The tilebaits subcommand divides the imported sequences into baits bareseqdu eosnte dbait 106 length and tiling density. The annot2baits and bed2bsuabitsc ommands extra scpt ecified genomic 107 features from the reference sequenTcehse. extracted sequences are output in FASTA format for 108 user reference. The extracted sequences are then passed to tilebaits to generate the candidate 109 baits. Similarly, the aln2biats divides the alignment into windows based on desired bait length 110 and tiling density. Baits are then generated either for each observed haplothtyinp ea wiindow 111 or for every permutation of variants observed within a window. This produces a weigiht tseedt ba This article is protected by copyright. All rights reserved Author Manuscript 112 that has higher coverage for more variable regions and redbuacite redd undancy for conserved 113 regions. Additionally, pyrad2baits caimn port individual loci as sequencaeli gnments rather than 114 SNP variant calls. The loci alignments are then passed 2tob aaitlns to generate weighted bait s ets, 115 116 Candidate baits arteh en output in FASTA format along with an optionBaEl D file specifying the 117 location of the bait sequences with regards to the input reference seq uences. 118 119 Quality control and filtration 120 During the final step, candidate baits are filteredu sbeyr- specified quality-control parameters. 121 Filterable paramete rinsclude GC content, bait melting temperature, reference sequence base 122 quality, percentage of masked bait sequenpcres, ence of gaps and unknown bases (Ns) in bait 123 sequence, sand whether generated baits are shorter than the specified desired bha.i Bt laeintsg t 124 with gap characters can also be extended with flanking sequence to ensure that deletion variants 125 are efficiently capturedB. aitsTools then generates a set of filtered baits in FASTA fot ramnad an 126 optional BED file describing the location of filtered baits with regard to the input reference 127 sequences. For the vcf2baits and stacks2baits commands, BaitsTools also produces a filtered 128 VCF or Stacks summary file, respectively. For user reference, the filtration parameters are added 129 to the header of the filtered VCF after the BaitsTools variant selection and bait generation 130 parameters. BaitsTools can also generate a sumtmhaatr tya bulates th feiltration parametersa nd 131 inclusion/exclusion from the final filtered bait sfoert each candidate ba Tit.he qualityc- ontrolled 132 baits are suitable either for dire mctanufacture of hybridization capture kits or further filtration 133 using platforms-pecific proprietary pipelines . 134 135 User interface 136 To accommodate users with different computational needs and comfort with comlinmea nd- 137 interfaces, BaitsToolsu tilizes both a standard comma-nlidne interface using arguments and an 138 interactive interface using text prom p(Ftsigure 2). An optional graphical frontend is also 139 available for macOS system Esx.ecuting the baitstools.rb script without subcommands or 140 arguments printas list of available subcommands and their functions to the screen. Detailed help 141 messages are available for each subcommand by executing the baitstools.rb script with a This article is protected by copyright. All rights reserved Author Manuscript 142 subcommand and theh ‘’- or ‘--help’ arguments. Executing the baitstools.rb scripht wa it 143 subcommand withoufut rther arguments launechs the interactive interface. For instance, 144 executing ‘baitstools.rb vcf2bai–tsh ’ prints detailed help on the vcf2baits subcomm,a wnhdereas 145 executing ‘baitstools.rb vcf2baits’ activates the interactive prso mfopr tthe vcf2baits 146 subcommand. Furthermore, to improve ufsriern-dliness, BaitsTools will interactively prompt the 147 user to correct entries from the comm-alinde interface when the subcommand cannot be 148 executed as entered (e.g. if a needed input filet isfo nuond). Upon executionB, aitsTools will 149 print to the screetnh e complete interpreted command (including uusnesrp-ecified defaults) to 150 ensure that users can acactuerly reproduce their commanidns l ater analyse s . 151 152 Software requirements and licensing 153 BaitsTools is a se-lcf ontained Ruby (Matsumoto 2013) package and is therefore compatible with 154 most UNIX and UNIX-like operating systems. Besides Ruby (version 2.0 or greater) and its 155 standard librar,y BaitsTools has no additional dependencies and doeres qnuoirt e local 156 compilation before executio nT.he optional frontend requires the Ruby gem “tk” (version 1.2 or 157 greater) (Shibata 201 7a)nd the Ruby Version Manager (SeguinP a&p is 2016).B aitsTools is 158 compatible with both the Ruby reference implementation (Matz’s Ruby Inter)p arentde rthe 159 Rubinius (version 3.73 or greater) compiler (Phoenix 20T0h6e). program is freely available 160 under theS mithsonian Institution terms of use (http://www.e.dsiu/termsofuse). 161 162 BaitsTools pipelines 163 Although BaitsTools produces high-quality bait sets on its obwaint ,s etp erformance can be 164 improved with the addition of external tools into the bait generation pip(eFlignuer e 3). To 165 reducet he capture of repetitiv reegions and low complexity sequencRese,p eatMasker (Smeitt al. 166 2013–2015) can mask these featu riens the reference sequences. BaitsTools can then exclude 167 baits that include repetitive sequences using the’ o‘-rK ‘--maxmask’ arguments. Downstream of 168 BaitsTools,b aits can be clustereuds ing Cd-hit (Li & Godzik 2006) toe fficiently remove overly 169 redundant sequenc.e BsLAST (Altschul et al. 1990)s earche sof bait sequences against reference 170 genomes and the other candidate baits can help identify problematic oligonucleo tides for This article is protected by copyright. All rights reserved Author Manuscript 171 removal. Common issues include nsopne-cific baits that can hybridize with multiple genomic 172 targets, se-lfcomplementarity, and inter-bait hybridization. 173 174 Comparison to existing software 175 BaitsTools is more flexible and covers a wider variethy yobfr idization capturea pplications than 176 existing publicly available software, such as BaitDesigner (Broad Institute 2017) a nd the 177 PHYLUCE ultra-conserved element (UCE) workflow (Faircloth 201B7a).i tDesigner is na 178 unpublished oligonucleotide bait design toinocll uded within the Picard package (Broad Institute 179 2017).B aitDesigner implements a few features not currently included in Baits (Tsouoclhs as 180 Agilent file output). However,B aitDesigner only accepts FASTA sequences as iannpdu thas 181 limited bait filtration and quality control optio.n Ist also requires the generation of Picard interval 182 lists prior to usageT. his interval list can be us etod extract regions of interest from the reference 183 sequenceB. aitsTools’s bed2bai tasnd annot2baits pfeorrms similar region extractio wnithout the 184 need for a custom file format. 185 186 The PHYLUCE UCE workflow is designed to identify and produce baits for UCE loci from aligned 187 genomes and sequence data (Faircloth 2017). Although BaitsTools does not UidCenEtsi,f yit can 188 be used to design appropriate bait sequences onscee l othcie are identified.B aitsTools, however, 189 does not provide the pocsat-pture and sequceing UCE data analysis pipeline included in 190 PHYLUCE (Faircloth et al. 2012).N either BaitDesigner nor thPeH YLUCE UCE workflow can 191 design baits from VCF, sLOCI files, or Stacks population summary statisticss f.i le 192 193 Performance 194 To benchmark typical BaitsTools performanbcaei,t s were generated and filtered using sequence 195 data from previously sequenced African wild d(oLgycaon pictus) genomes (Campaneat al. 196 2016), reference sequences from GenBank (accessions: KT448283.1, NC_008093.1, 197 NC_002008.4, NC_006621.3) (Bjornerfeeldt ta l. 2006; Kim et al. 1998; Koepflie t al. 2015; 198 Lindbad-Toh et al. 2005), and simulateSdt acks data anidp yrad loci (available: 199 ipyrad.readthedocs.io/output_formats.h)t.m Blenchmarked datasets are included in the example 200 data within the BaitsTools repository, except for Cthaen is familiaris X chromosome sequence This article is protected by copyright. All rights reserved Author Manuscript 201 (GenBank accession: NC0_6021.3) due to file size limitations. All benchmark analyses used 202 BaitsTools version 0.9 and were performed single-threaded on a desktop computer running 203 macOS El Capitan (10.11.6) powered by a 3.5 GHz hexacore Intel Xeon E5 processor with 64 204 GB 1866 MHzD DR3 ECC memory. Benchmark analyses and results are summarized in Table 2. 205 Benchmark analyses were run under default setutingless s otherwise not.e Rd NA baits were 206 generated to capture DNA sequences. Sequence ambiguities were collapsed-.l eTnhget hfu bllait 207 was required. Baits including gaps or unknown bases were excluded. Retained baits ’ had GC 208 contents between 30% and 50% and melting temperatures were between 0.0°C and 120.0°C. 209 Parameter files, absolute BED coordinates (except in the checkbaits expt)e, raimnde ndetailed 210 logs were output for all experiments. 211 212 Furthermore, to compare performance between BaitsTools tilebaits and BaitD e(Psicganredr 213 version 2.9.4), baitws ere generatefdro m a 16,725 bpL ycaon pictus mitogenome (GenBank 214 accession: CM007595.1; Campaent a l. 2016) under analogous settings. Each program 215 generated 120 bp baits with a 60 bp offset between baits. The full-length bait wasd r. eSqinucire 216 BaitDesigner does not filter ba iatsnd cannot tile over circular sequen,c neos other filters were 217 applied in BaitsTool sand them itogenome was treated as a linear sequ. eBnaciet coordinates were 218 output either as an interval list (BaitDesigner) or as a BED file (BaitsT oBoalsit)s.Designer 219 completed the task in 0.512 wall-clock seconds (0.814 user seconds, 0.092 system seconds), 220 whereas tilebaits finished in1 02.0 wall-clock seconds (01.00 user seconds, 0.0s1y4s tem 221 seconds) T. he resulting baits were identic al. 222 223 BaitsTools isf ast. Most benchmarking experiments compleinte lde ss than a second. 224 Furthermoret,i lebaits produced the same bait set as BaitDers ignn 2e3% of the wall-clock time 225 and 12% of the user time. 226 227 Conclusion 228 BaitsTools is a use-friendly, fast, open-sources oftware package that simplifies the production of 229 baits for hybridization capture. Since the software is highly user configurable aadnsd a r variety This article is protected by copyright. All rights reserved Author Manuscript 230 of input formats, BaitsTools can produce baits for a wide range of targeted genomics 231 applications. 232 233 Acknowledgements 234 The author thanks the members of the Center for Conservation Genomics, Smnit hsonia 235 Conservation Biology Institute for their supp oTrht.e National Science Foundatioanw (ard DEB- 236 1547168),t he Morris Animal Foundation (award D14ZO-308), and the National Geographic 237 Society (award 884-610) supported this research. 238 239 References 240 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. 241 Journal of Molecular Biology, 215, 403–410. 242 Bjornerfeldt S, Webster MT, Vilà C (2006) Relaxation of selective cons torani ndtog 243 mitochondrial DNA following domesticationG. enome Research, 16, 990–994. 244 Broad Institute (2017) Picard 2.11.0. Available frobmro: adinstitute.github.io/pica/r.d 245 Burbano HA, Hodges E, Green RetE a l. (2010) Targeted investigation of the Neandertal genome 246 by array-based sequence capturSec.i ence, 328, 723–725 . 247 Campana MG, Hawkins MTR, Henson LeHt a l. (2016) Simultaneous identification of host, 248 ectoparasite a,nd pathogen DNA via in-solution captuMreo. lecular Ecology Resources, 249 16, 1224–1239. 250 Campana MG, Parker LD, Hawkins MTeRt a l. (2016) Genome sequence, population history, 251 and pelage genetics of the endangered African wild Ldyocga o(n pictus). BMC Genomics, 252 17, 1013. 253 Catchen J, Hohenlohe PA, Bassheatm al . (2013) Stacks: an analysis tool set for population 254 genomics.M olecular Ecology, 22, 3124–3140. 255 Catchen JM, Amores A, Hohenlohee tP a l. (2011)S tacks: building and genotyping locdie novo 256 from short-read sequenceGs.3 : Genes, Genomes, Genetics, 1, 171–182. 257 Eaton DAR (2014) PyRAD: assembly doef novo RADseq loci for phylogenetic analyses. 258 Bioinformatics, 30, 1844–1849. 259 Faircloth BC (2017) Identifying conserved genomic elements and designing universeatl sb taoit s This article is protected by copyright. All rights reserved Author Manuscript 260 enrich them.M ethods in Ecology and Evolution. doi: http://dx.doi.org/10.1111/2041- 261 210X.12754. 262 Faircloth BC, McCormack JE, NG Crawfoerdt al. (2012) Ultraconserved elements anchor 263 thousands of genetic markers for target enrichmentn sinpga nmultiple evolutionary 264 timescalesS. ystematic Biology, 61, 717–726. 265 Gnirke A, Melnikov A, Maguire Je t al. (2009) Solution hybrid selection with ultra-long 266 oligonucleotides for massively parallel targeted sequenNciantgu.r e Biotechnology, 27, 267 182–189. 268 Hawkins MTR, Hofman CA, Callicrate eTt al. (2016) In-solution hybridization for mammalian 269 mitogenome enrichment: pros, cons and challenges associated with multiplexing degraded 270 DNA. Molecular Ecology Resources, 16, 1173–1188. 271 Kim KS, Lee SE, Jeong HW, Ha JH (1998) The complete nucleotide sequence of theic d omest 272 dog (Canis familiaris) mitochondral genomeM. olecular Phylogenetics and Evolution, 10, 273 –220. 274 Koepfli K-P, Pollinger J, Godinho Ret al. (2015) Genom-ewide evidence reveals that Africa n 275 and Eurasian golden jackals are distinct speCciuersr.e nt Biology, 16, 2158–2165. 276 Li W, Godzik A (2006) Cd-hit: a fast program for clustering and comparing large s pertost eoifn 277 or nucleotide sequenceBsi.o informatics, 10, 1658–1659, 278 Lim HC, Braun MJ (2016) High-throughput SNP genotyping of historical and modern sam ples of 279 five bird species via sequence capture of ultraconserved elemMeonletcsu. lar Ecology 280 Resources, 16, 1204–1223. 281 Lindblad-Toh K, Wade CM, Mikkelseent al. (2005) Genome sequence, comparative analysis 282 and haplotype structure of the domestic dNoagt.u re, 438, 803–819. 283 Ng SB, Turner EH, Robertson PeDt a l. (2009) Targetedca pture and massively paral lel 284 sequencing of 21 human exomesN.a ture, 461, 272–276. 285 Matsumoto Y (2013) Ruby programming language, version 2.0.0. Available from: 286 http://www.ruby-lang.or.g 287 Phoenix E (2006). Rubinius, version 3.73 Alavbalie from:h ttps://rubinius.com. 288 Shibata H (2017). Tk 0.1.0. Available from: https://rubygems.org/gems/tk/version.s /0.1.0 289 Seguin WE, Papis M (2016). RVM: Ruby Version Manager 1.2.7. Available fhrottmps:: //rvm.io. This article is protected by copyright. All rights reserved Author Manuscript 290 Smit AFA, Hubley R, Green P (201230–15). RepeatMasker Op-e4n.0. Available from: 291 http://www.repeatmasker.org. 292 293 Data Accessibility 294 The program, user documentat,io tnutorial, and example dataset are available on GitHub 295 (https://github.com/campanam/BaitsTo)o. ls 296 297 Author Contributions 298 M.G.C. wrote the software and manuscript and perforamlle adn alyses . 299 300 Tables and figures 301 Table 1: BaitsTools subcommands, their functions, and input file requirements. Subcommand Function Input formats aln2baits Generatev ariability-weightedb aits from Alignment: FASTA/FASTQ an alignment file annot2baits Generate baits from a genome Annotation: GTF/GFF annotation file and a reference seque ncReeference: FASTA/FAST Q bed2baits Generate baits from a BED file and a Features: BED reference sequen ce Reference: FASTA/FAST Q checkbait s Evaluate and filtepr reviously generated Baits: FASTA/FASTQ baits pyrad2baits Select variants and generate baits fromL ao ci: LOCI PyRAD and ipyrad loci files stacks2bait s Select variants and generate baits fromS ata cks: sumstatTsS. V Stacksp opulation summary statistics Reference: FASTA/FAST Q file and a reference sequen ce tilebaits Generate baits from a list of sequen cesSequences: FASTA/FAST Q vcf2baits Select variants and generate baits fromV aa riants: VCF VCF file and a reference seque nce ReferenceF: ASTA/FASTQ This article is protected by copyright. All rights reserved Author Manuscript 302 303 Table 2: BaitsTools benchmarking experiments. The ussyestre, m, and wal-lclock completion 304 times are listed in seconds. Benchmarked file names are listed at the end of the experiment 305 description in parenthese s). Subcommand Experiment description User System Wall-clock aln2baits Weighted baitsw ere generated and filtered0 .250 0.016 0.276 from an alignment of five canid mitogenomes (canid_mito_aln.fa). annot2baits Baits were generated and filterfeodr all 0.052 0.011 0.065 annotated genes and tRNAs fromLy ac aon pictus mitogenome (Ananku.fa, Ananku.gff). bed2baits Weighted baits were generated and filtere0d. 066 0.012 0.080 from five 999-bp regions from an alignment of five canid mitogenomes (canid_mito_aln.fa, canid_mito_aln.bed). checkbait s Bait quality controlw as performedo n the 0.0148 0.014 00.0167 baits output from the aln2baits benchmarking experiment. pyrad2baits Baits were generated and filtered from two0 .054 0.086 0.017 simulated ipyrad loctir eated as sequence alignments (ipyrad.loci). stacks2bait s Variants were sorted by population and 0.054 0.012 0.070 deviation from Hardy-Weinberg Equilibrium (α = 0.025; options -H -A 0.025). Up to five variants (optiont 5-) per category were selected. No baits were output (option p- ) (example.sumstats.ts v). tilebaits Baits were generated and filterferodm two 0.243 0.014 0.243 This article is protected by copyright. All rights reserved Author Manuscript Lycaon pictus mitogenomes and a FASTA file of canid pelage genes (lycaon_mito.fa, pelage_genes.fa ). vcf2baits One-hundredL ycaon pictus X 988.553 1.248 990.551 chromosomes equence varian wtsere selected (option --m 100).B aits were generated and filterefdro m the selected variants using theC anis familiaris reference sequence (NC_00621 .3) (WDF20_X.raw.vcf.gz). 306 307 Figure 1: BaitsTools workflow. The entry points for each subcommand and the ofruotpmu tesa ch 308 BaitsTools step are liste pdy. rad2baits is listed twice since it can treat input LOCI files either as 309 variant-call files or sequence alignmen. ts 310 311 Figure 2: BaitsTools interactive interface. Executing the baitstools.rb script without further 312 arguments prints the splash screen detailing the available subcommands (top). Executing the 313 script with a subcommand (but omitting other arguments) starts the inivte rianctet rface (bottom). 314 Here the user has started the interactive prompts for the vcf2baits subco mmand. 315 316 Figure 3: An example pipelinteo generate highest-quality oligonucleotide bait sReetsf.e rence 317 sequences are masked with RepeatMasker to remove irvep aentidt low-complexity sequences. 318 Candidate baits are generated from the masked reference sequences and filtered using BaitsTools. 319 Filtered bait sequences are clustered usin-gh iCt. dFinally bait sets are interrogated using BLAST 320 searches for features su acsh inte-rbait hybridization. This article is protected by copyright. All rights reserved Author Manuscript Table 1: BaitsTools subcommands, their functions, and input file requirements. Subcommand Function Input formats aln2baits Generate variability-weighted baits fromA lignment: FASTA/FASTQ an alignment file annot2baits Generate baits from a genome Annotation: GTF/GFF annotation file and a reference sequencRee ference: FASTA/FASTQ bed2baits Generate baits from a BED file and a Features: BED reference sequence Reference: FASTA/FASTQ checkbaits Evaluate and filter previously generatedB aits: FASTA/FASTQ baits pyrad2baits Select variants and generate baits fromL ao ci: LOCI PyRAD and ipyrad loci files stacks2baits Select variants and generate baits fromS ata cks: sumstatTsS. V Stacks population summary statistics Reference: FASTA/FASTQ file and a reference sequence tilebaits Generate baits from a list of sequencesS equences: FASTA/FASTQ vcf2baits Select variants and generate baits fromV aa riants: VCF VCF file and a reference sequence Reference: FASTA/FASTQ This article is protected by copyright. All rights reserved Author Manuscript Table 2: BaitsTools benchmarking experiments. The user, system, and wall-clock completion times are listed in seconds. Benchmarked file names are listed at the end of the experiment description in parentheses). Subcommand Experiment description User System Wall-clock aln2baits Weighted baits were generated and filtere0d. 250 0.016 0.276 from an alignment of five canid mitogenomes (canid_mito_aln.fa). annot2baits Baits were generated and filtered for all 0.052 0.011 0.065 annotated genes and tRNAs from a Lycaon pictus mitogenome (Ananku.fa, Ananku.gff). bed2baits Weighted baits were generated and filtere0d. 066 0.012 0.080 from five 999-bp regions from an alignment of five canid mitogenomes (canid_mito_aln.fa, canid_mito_aln.bed). checkbaits Bait quality control was performed on the 0.0148 0.014 00.0167 baits output from the aln2baits benchmarking experiment. pyrad2baits Baits were generated and filtered from two0 .054 0.086 0.017 simulated ipyrad loci treated as sequence alignments (ipyrad.loci). stacks2baits Variants were sorted by population and 0.054 0.012 0.070 deviation from Hardy-Weinberg Equilibrium (α = 0.025; optionsH - -A 0.025). Up to five variants (option -t 5) per category were selected. No baits were output (option -p) (example.sumstats.tsv). tilebaits Baits were generated and filtered from two0 .243 0.014 0.243 Lycaon pictus mitogenomes and a FASTA file of canid pelage genes (lycaon_mito.fa, pelage_genes.fa). This article is protected by copyright. All rights reserved Author Manuscript vcf2baits One-hundred Lycaon pictuXs 988.553 1.248 990.551 chromosome sequence variants were selected (option-- m 100). Baits were generated and filtered from the selected variants using the Canis familiaris reference sequence (NC_00621.3) (WDF20_X.raw.vcf.gz). This article is protected by copyright. All rights reserved Author Manuscript Subcommands men_12721_f1.pdf Output pyrad2baits Variant Filtered stacks2baits Selection variants vcf2baits aln2baits annot2baits Bait Candidate bed2baits Generation baits pyrad2baits tilebaits Quality Control Bait QC results checkbTahiist asrticle is protected by copyrighFt. Ailllt rrigahttsi oresnerved Filtered baits Author Manuscript men_12721_f2.tiff This article is protected by copyright. All rights reserved Author Manuscript Input men_12721_f3.pdf sequences RepeatMasker BaitsTools Cd-hit BLAST This article is protected by copyright. All rights reserved Output baits Author Manuscript