Evaluation of methods for detecting human reads in microbial sequencing datasets

Stephen J. Bush; Thomas R. Connor; Tim E.A. Peto; Derrick W. Crook; A. Sarah Walker

doi:10.1099/mgen.0.000393

Volume 6, Issue 7

Research Article

Open Access

Evaluation of methods for detecting human reads in microbial sequencing datasets

Stephen J. Bush¹, Thomas R. Connor^2,3, Tim E.A. Peto^1,4,5, Derrick W. Crook^1,4,5 and A. Sarah Walker^1,4,5
View Affiliations Hide Affiliations

Affiliations: ¹ Nuffield Department of Medicine, University of Oxford, Oxford, UK ² Organisms and Environment Division, School of Biosciences, Cardiff University, Cardiff, Wales, UK ³ Public Health Wales, University Hospital of Wales, Cardiff, UK ⁴ National Institute for Health Research Health Research Protection Unit in Healthcare Associated Infections and Antimicrobial Resistance at University of Oxford in partnership with Public Health England, Oxford, UK ⁵ National Institute for Health Research Oxford Biomedical Research Centre, Oxford, UK
*Correspondence: Stephen J. Bush, [email protected]
Published: 19 June 2020 https://doi.org/10.1099/mgen.0.000393

Abstract

Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to predict those of human origin, and then discarding them. To inform best practice guidelines, we benchmarked eight alignment-based and two classification-based methods of human read detection using simulated data from 10 clinically prevalent bacteria and three viruses, into which contaminating human reads had been added. While the majority of methods successfully detected >99 % of the human reads, they were distinguishable by variance. The most precise methods, with negligible variance, were Bowtie2 and SNAP, both of which misidentified few, if any, bacterial reads (and no viral reads) as human. While correctly detecting a similar number of human reads, methods based on taxonomic classification, such as Kraken2 and Centrifuge, could misclassify bacterial reads as human, although the extent of this was species-specific. Among the most sensitive methods of human read detection was BWA, although this also made the greatest number of false positive classifications. Across all methods, the set of human reads not identified as such, although often representing <0.1 % of the total reads, were non-randomly distributed along the human genome with many originating from the repeat-rich sex chromosomes. For viral reads and longer (>300 bp) bacterial reads, the highest performing approaches were classification-based, using Kraken2 or Centrifuge. For shorter (c. 150 bp) bacterial reads, combining multiple methods of human read detection maximized the recovery of human reads from contaminated short read datasets without being compromised by false positives. A particularly high-performance approach with shorter bacterial reads was a two-stage classification using Bowtie2 followed by SNAP. Using this approach, we re-examined 11 577 publicly archived bacterial read sets for hitherto undetected human contamination. We were able to extract a sufficient number of reads to call known human SNPs, including those with clinical significance, in 6 % of the samples. These results show that phenotypically distinct human sequence is detectable in publicly archived microbial read datasets.

Received: 17/12/2019
Accepted: 25/05/2020
Published Online: 19/06/2020

Keyword(s): contamination , human , read depletion and read removal

Funding

This study was supported by the:

National Institute for Health Research Health Protection Research Unit (Award HPRU-2012-10041)
- Principle Award Recipient: Not Applicable

This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000393

2020-06-19

2024-04-26

Full text loading...

/deliver/fulltext/mgen/6/7/mgen000393.html?itemId=/content/journal/mgen/10.1099/mgen.0.000393&mimeType=html&fmt=ahah

References

Meadow JF, Altrichter AE, Bateman AC, Stenson J, Brown GZ et al. Humans differ in their personal microbial cloud. PeerJ 2015; 3:e1258 [View Article][PubMed]
[Google Scholar]
Salzberg SL, Breitwieser FP, Kumar A, Hao H, Burger P et al. Next-Generation sequencing in neuropathologic diagnosis of infections of the nervous system. Neurol Neuroimmunol Neuroinflamm 2016; 3:e251-ee251 [View Article][PubMed]
[Google Scholar]
Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A Bioinformatician's guide to Metagenomics. Microbiology and Molecular Biology Reviews 2008; 72:557–578 [View Article]
[Google Scholar]
Gurwitz D, Fortier I, Lunshof JE, Knoppers BM. Research ethics. children and population biobanks. Science 2009; 325:818–819 [View Article][PubMed]
[Google Scholar]
Homer N, Szelinger S, Redman M, Duggan D, Tembe W et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 2008; 4:e1000167 [View Article][PubMed]
[Google Scholar]
Merchant S, Wood DE, Salzberg SL. Unexpected cross-species contamination in genome sequencing projects. PeerJ 2014; 2:e675 [View Article][PubMed]
[Google Scholar]
Kryukov K, Imanishi T. Human contamination in public genome assemblies. PLoS One 2016; 11:e0162424 [View Article][PubMed]
[Google Scholar]
Longo MS, O'Neill MJ, O'Neill RJ. Abundant human DNA contamination identified in non-primate genome databases. PLoS One 2011; 6:e16410 [View Article][PubMed]
[Google Scholar]
Hasman H, Saputra D, Sicheritz-Ponten T, Lund O, Svendsen CA et al. Rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples. J Clin Microbiol 2014; 52:139–146 [View Article][PubMed]
[Google Scholar]
Cheng J, Hu H, Fang W, Shi D, Liang C et al. Detection of pathogens from resected heart valves of patients with infective endocarditis by next-generation sequencing. Int J Infect Dis 2019; 83:148–153 [View Article][PubMed]
[Google Scholar]
Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ-M et al. A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4. JAMA 2013; 309:1502–1510 [View Article][PubMed]
[Google Scholar]
Haston JC, Rostad CA, Jerris RC, Milla SS, McCracken C et al. Prospective cohort study of next-generation sequencing as a diagnostic modality for unexplained encephalitis in children. J Pediatric Infect Dis Soc 2019; 53: [View Article][PubMed]
[Google Scholar]
Wilson MR, Naccache SN, Samayoa E, Biagtan M, Bashir H et al. Actionable diagnosis of neuroleptospirosis by next-generation sequencing. N Engl J Med 2014; 370:2408–2417 [View Article][PubMed]
[Google Scholar]
Naccache SN, Federman S, Veeraraghavan N, Zaharia M, Lee D et al. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res 2014; 24:1180–1192 [View Article][PubMed]
[Google Scholar]
Haque MM, Bose T, Dutta A, Reddy CVSK, Mande SS. CS-SCORE: rapid identification and removal of human genome contaminants from metagenomic datasets. Genomics 2015; 106:116–121 [View Article][PubMed]
[Google Scholar]
Schmieder R, Edwards R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One 2011; 6:e17288 [View Article][PubMed]
[Google Scholar]
Vance DP, Czajkowski MD, Casaburi G, Frese SA. GenCoF: a graphical user interface to rapidly remove human genome contaminants from metagenomic datasets; 2018
Rawat A, Engelthaler DM, Driebe EM, Keim P, Foster JT. MetaGeniE: characterizing human clinical samples using deep metagenomic sequencing. PLoS One 2014; 9:e110915 [View Article][PubMed]
[Google Scholar]
Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res 2016; 26:1721–1729 [View Article][PubMed]
[Google Scholar]
Davis MPA, van Dongen S, Abreu-Goodger C, Bartonicek N, Enright AJ. Kraken: a set of tools for quality control and analysis of high-throughput sequence data. Methods 2013; 63:41–49 [View Article][PubMed]
[Google Scholar]
Foulex A, Coen M, Cherkaoui A, Lazarevic V, Gaïa N et al. Listeria monocytogenes infectious periaortitis: a case report from the infectious disease standpoint. BMC Infect Dis 2019; 19:326 [View Article][PubMed]
[Google Scholar]
Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol 2019; 20:257 [View Article][PubMed]
[Google Scholar]
Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines. Gigascience 2020; 9:giaa007 [View Article]
[Google Scholar]
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9:357–359 [View Article]
[Google Scholar]
Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009; 25:1754–1760 [View Article][PubMed]
[Google Scholar]
Marco-Sola S, Sammeth M, Guigó R, Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat Methods 2012; 9:1185–1188 [View Article][PubMed]
[Google Scholar]
Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods 2015; 12:357–360 [View Article][PubMed]
[Google Scholar]
Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018; 34:bty191-bty3094–3100 [View Article][PubMed]
[Google Scholar]
Zaharia M, Bolosky WJ, Curtis K, Fox A, Patterson D et al. Faster and More Accurate Sequence Alignment with SNAP. arXiv e-prints [Internet]. 2011 November 01, 2011. Available from: https://ui.adsabs.harvard.edu/abs/2011arXiv1111.5572Z .
Sandmann S, de Graaf AO, Karimi M, van der Reijden BA, Hellström-Lindberg E et al. Evaluating variant calling tools for Non-Matched next-generation sequencing data. Sci Rep 2017; 7:43169 [View Article][PubMed]
[Google Scholar]
Mullany P, Allan E, Roberts AP. Mobile genetic elements in Clostridium difficile and their role in genome function. Res Microbiol 2015; 166:361–367 [View Article][PubMed]
[Google Scholar]
Anderson MT, Seifert HS. Opportunity and means: horizontal gene transfer from the human host to a bacterial pathogen. mBio 2011; 2:e00005–00011 [View Article][PubMed]
[Google Scholar]
Kinsella RJ, Kähäri A, Haider S, Zamora J, Proctor G et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database 2011; 2011:bar030. [View Article][PubMed]
[Google Scholar]
Quintana-Murci L, Fellous M. The Human Y Chromosome: The Biological Role of a Functional Wasteland. J Biomed Biotechnol 2001; 1:18–24 [View Article][PubMed]
[Google Scholar]
Ross MT, Grafham DV, Coffey AJ, Scherer S, McLay K et al. The DNA sequence of the human X chromosome. Nature 2005; 434:325–337 [View Article][PubMed]
[Google Scholar]
Slotkin RK. The case for not masking away repetitive DNA. Mob DNA 2018; 9:15 [View Article][PubMed]
[Google Scholar]
Tian S, Yan H, Neuhauser C, Slager SL. An analytical workflow for accurate variant discovery in highly divergent regions. BMC Genomics 2016; 17:703 [View Article][PubMed]
[Google Scholar]
Lim J, Bae S-C, Kim K. Understanding HLA associations from SNP summary association statistics. Sci Rep 2019; 9:1337 [View Article][PubMed]
[Google Scholar]
Sahl JW, Lemmer D, Travis J, Schupp JM, Gillece JD et al. Nasp: an accurate, rapid method for the identification of SNPs in WGS datasets that supports flexible input and output formats. Microb Genom 2016; 2:e000074-e [View Article][PubMed]
[Google Scholar]
Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014; 42:D980–D985 [View Article][PubMed]
[Google Scholar]
Sturm RA, Duffy DL, Zhao ZZ, Leite FPN, Stark MS et al. A single SNP in an evolutionary conserved region within intron 86 of the HERC2 gene determines human blue-brown eye color. Am J Hum Genet 2008; 82:424–431 [View Article][PubMed]
[Google Scholar]
Morales E, Azocar L, Maul X, Perez C, Chianale J et al. The European lactase persistence genotype determines the lactase persistence state and correlates with gastrointestinal symptoms in the Hispanic and Amerindian Chilean population: a case-control and population-based study. BMJ Open 2011; 1:e000125 [View Article][PubMed]
[Google Scholar]
Macgregor S, Lind PA, Bucholz KK, Hansell NK, Madden PAF et al. Associations of ADH and ALDH2 gene variation with self report alcohol reactions, consumption and dependence: an integrated analysis. Hum Mol Genet 2009; 18:580–593 [View Article][PubMed]
[Google Scholar]
Ballouz S, Dobin A, Gillis JA. Is it time to change the reference genome?. Genome Biol 2019; 20:159 [View Article][PubMed]
[Google Scholar]
Sherman RM, Forman J, Antonescu V, Puiu D, Daya M et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet 2019; 51:30–35 [View Article][PubMed]
[Google Scholar]
Southgate JA, Bull MJ, Brown CM, Watkins J, Corden S et al. Influenza classification from short reads with vapor facilitates robust mapping pipelines and zoonotic strain detection for routine surveillance applications. bioRxiv 2019; 597062:
[Google Scholar]
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9:357359 [View Article][PubMed]
[Google Scholar]
Broad Institute Picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. 2018; Available from: http://broadinstitute.github.io/picard/ .
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. The sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25:2078–2079 [View Article][PubMed]
[Google Scholar]
Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010; 26:841–842 [View Article][PubMed]
[Google Scholar]
1000 Genomes Project Consortium Auton A, Brooks LD, Durbin RM, Garrison EP, Abecasis GR, Altshuler DM et al. A global reference for human genetic variation. Nature 2015; 526:68 [View Article][PubMed]
[Google Scholar]
Alexa A, Rahnenführer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating go graph structure. Bioinformatics 2006; 22:1600–1607 [View Article][PubMed]
[Google Scholar]

http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000393

Evaluation of methods for detecting human reads in microbial sequencing datasets

M Gen 6, e000393 (2020); https://doi.org/10.1099/mgen.0.000393

/content/journal/mgen/10.1099/mgen.0.000393

Volume 6, Issue 7

Research Article

Open Access

Evaluation of methods for detecting human reads in microbial sequencing datasets

Abstract

Funding

Supplementary material 1

Supplementary material 2

Most read this month

Most cited Most Cited RSS feed

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

ResFinder – an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes

MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies

SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments

Completing bacterial genome assemblies with multiplex MinION sequencing

ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping

Identification of Klebsiella capsule synthesis loci from whole genome data

Emergence, molecular mechanisms and global spread of carbapenem-resistant Acinetobacter baumannii

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

Microreact: visualizing and sharing data for genomic epidemiology and phylogeography