1887

Abstract

Sequencing data from host-associated microbes can often be contaminated by the body of the investigator or research subject. Human DNA is typically removed from microbial reads either by subtractive alignment (dropping all reads that map to the human genome) or by using a read classification tool to predict those of human origin, and then discarding them. To inform best practice guidelines, we benchmarked eight alignment-based and two classification-based methods of human read detection using simulated data from 10 clinically prevalent bacteria and three viruses, into which contaminating human reads had been added. While the majority of methods successfully detected >99 % of the human reads, they were distinguishable by variance. The most precise methods, with negligible variance, were Bowtie2 and SNAP, both of which misidentified few, if any, bacterial reads (and no viral reads) as human. While correctly detecting a similar number of human reads, methods based on taxonomic classification, such as Kraken2 and Centrifuge, could misclassify bacterial reads as human, although the extent of this was species-specific. Among the most sensitive methods of human read detection was BWA, although this also made the greatest number of false positive classifications. Across all methods, the set of human reads not identified as such, although often representing <0.1 % of the total reads, were non-randomly distributed along the human genome with many originating from the repeat-rich sex chromosomes. For viral reads and longer (>300 bp) bacterial reads, the highest performing approaches were classification-based, using Kraken2 or Centrifuge. For shorter (c. 150 bp) bacterial reads, combining multiple methods of human read detection maximized the recovery of human reads from contaminated short read datasets without being compromised by false positives. A particularly high-performance approach with shorter bacterial reads was a two-stage classification using Bowtie2 followed by SNAP. Using this approach, we re-examined 11 577 publicly archived bacterial read sets for hitherto undetected human contamination. We were able to extract a sufficient number of reads to call known human SNPs, including those with clinical significance, in 6 % of the samples. These results show that phenotypically distinct human sequence is detectable in publicly archived microbial read datasets.

Funding
This study was supported by the:
  • Not Applicable , National Institute for Health Research Health Protection Research Unit , (Award HPRU-2012-10041)
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000393
2020-06-19
2020-08-09
Loading full text...

Full text loading...

/deliver/fulltext/mgen/6/7/mgen000393.html?itemId=/content/journal/mgen/10.1099/mgen.0.000393&mimeType=html&fmt=ahah

References

  1. Meadow JF, Altrichter AE, Bateman AC, Stenson J, Brown GZ et al. Humans differ in their personal microbial cloud. PeerJ 2015; 3:e1258 [CrossRef][PubMed]
    [Google Scholar]
  2. Salzberg SL, Breitwieser FP, Kumar A, Hao H, Burger P et al. Next-Generation sequencing in neuropathologic diagnosis of infections of the nervous system. Neurol Neuroimmunol Neuroinflamm 2016; 3:e251-ee251 [CrossRef][PubMed]
    [Google Scholar]
  3. Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A Bioinformatician's guide to Metagenomics. Microbiology and Molecular Biology Reviews 2008; 72:557–578 [CrossRef]
    [Google Scholar]
  4. Gurwitz D, Fortier I, Lunshof JE, Knoppers BM. Research ethics. children and population biobanks. Science 2009; 325:818–819 [CrossRef][PubMed]
    [Google Scholar]
  5. Homer N, Szelinger S, Redman M, Duggan D, Tembe W et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet 2008; 4:e1000167 [CrossRef][PubMed]
    [Google Scholar]
  6. Merchant S, Wood DE, Salzberg SL. Unexpected cross-species contamination in genome sequencing projects. PeerJ 2014; 2:e675 [CrossRef][PubMed]
    [Google Scholar]
  7. Kryukov K, Imanishi T. Human contamination in public genome assemblies. PLoS One 2016; 11:e0162424 [CrossRef][PubMed]
    [Google Scholar]
  8. Longo MS, O'Neill MJ, O'Neill RJ. Abundant human DNA contamination identified in non-primate genome databases. PLoS One 2011; 6:e16410 [CrossRef][PubMed]
    [Google Scholar]
  9. Hasman H, Saputra D, Sicheritz-Ponten T, Lund O, Svendsen CA et al. Rapid whole-genome sequencing for detection and characterization of microorganisms directly from clinical samples. J Clin Microbiol 2014; 52:139–146 [CrossRef][PubMed]
    [Google Scholar]
  10. Cheng J, Hu H, Fang W, Shi D, Liang C et al. Detection of pathogens from resected heart valves of patients with infective endocarditis by next-generation sequencing. Int J Infect Dis 2019; 83:148–153 [CrossRef][PubMed]
    [Google Scholar]
  11. Loman NJ, Constantinidou C, Christner M, Rohde H, Chan JZ-M et al. A culture-independent sequence-based metagenomics approach to the investigation of an outbreak of Shiga-toxigenic Escherichia coli O104:H4. JAMA 2013; 309:1502–1510 [CrossRef][PubMed]
    [Google Scholar]
  12. Haston JC, Rostad CA, Jerris RC, Milla SS, McCracken C et al. Prospective cohort study of next-generation sequencing as a diagnostic modality for unexplained encephalitis in children. J Pediatric Infect Dis Soc 2019; 53: [CrossRef][PubMed]
    [Google Scholar]
  13. Wilson MR, Naccache SN, Samayoa E, Biagtan M, Bashir H et al. Actionable diagnosis of neuroleptospirosis by next-generation sequencing. N Engl J Med 2014; 370:2408–2417 [CrossRef][PubMed]
    [Google Scholar]
  14. Naccache SN, Federman S, Veeraraghavan N, Zaharia M, Lee D et al. A cloud-compatible bioinformatics pipeline for ultrarapid pathogen identification from next-generation sequencing of clinical samples. Genome Res 2014; 24:1180–1192 [CrossRef][PubMed]
    [Google Scholar]
  15. Haque MM, Bose T, Dutta A, Reddy CVSK, Mande SS. CS-SCORE: rapid identification and removal of human genome contaminants from metagenomic datasets. Genomics 2015; 106:116–121 [CrossRef][PubMed]
    [Google Scholar]
  16. Schmieder R, Edwards R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One 2011; 6:e17288 [CrossRef][PubMed]
    [Google Scholar]
  17. Vance DP, Czajkowski MD, Casaburi G, Frese SA. GenCoF: a graphical user interface to rapidly remove human genome contaminants from metagenomic datasets; 2018
  18. Rawat A, Engelthaler DM, Driebe EM, Keim P, Foster JT. MetaGeniE: characterizing human clinical samples using deep metagenomic sequencing. PLoS One 2014; 9:e110915 [CrossRef][PubMed]
    [Google Scholar]
  19. Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res 2016; 26:1721–1729 [CrossRef][PubMed]
    [Google Scholar]
  20. Davis MPA, van Dongen S, Abreu-Goodger C, Bartonicek N, Enright AJ. Kraken: a set of tools for quality control and analysis of high-throughput sequence data. Methods 2013; 63:41–49 [CrossRef][PubMed]
    [Google Scholar]
  21. Foulex A, Coen M, Cherkaoui A, Lazarevic V, Gaïa N et al. Listeria monocytogenes infectious periaortitis: a case report from the infectious disease standpoint. BMC Infect Dis 2019; 19:326 [CrossRef][PubMed]
    [Google Scholar]
  22. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol 2019; 20:257 [CrossRef][PubMed]
    [Google Scholar]
  23. Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism–calling pipelines. Gigascience 2020; 9:giaa007 [CrossRef]
    [Google Scholar]
  24. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9:357–359 [CrossRef]
    [Google Scholar]
  25. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009; 25:1754–1760 [CrossRef][PubMed]
    [Google Scholar]
  26. Marco-Sola S, Sammeth M, Guigó R, Ribeca P. The GEM mapper: fast, accurate and versatile alignment by filtration. Nat Methods 2012; 9:1185–1188 [CrossRef][PubMed]
    [Google Scholar]
  27. Kim D, Langmead B, Salzberg SL. HISAT: a fast spliced aligner with low memory requirements. Nat Methods 2015; 12:357–360 [CrossRef][PubMed]
    [Google Scholar]
  28. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018; 34:bty191-bty3094–3100 [CrossRef][PubMed]
    [Google Scholar]
  29. Zaharia M, Bolosky WJ, Curtis K, Fox A, Patterson D et al. Faster and More Accurate Sequence Alignment with SNAP. arXiv e-prints [Internet]. 2011 November 01, 2011. Available from: https://ui.adsabs.harvard.edu/abs/2011arXiv1111.5572Z .
  30. Sandmann S, de Graaf AO, Karimi M, van der Reijden BA, Hellström-Lindberg E et al. Evaluating variant calling tools for Non-Matched next-generation sequencing data. Sci Rep 2017; 7:43169 [CrossRef][PubMed]
    [Google Scholar]
  31. Mullany P, Allan E, Roberts AP. Mobile genetic elements in Clostridium difficile and their role in genome function. Res Microbiol 2015; 166:361–367 [CrossRef][PubMed]
    [Google Scholar]
  32. Anderson MT, Seifert HS. Opportunity and means: horizontal gene transfer from the human host to a bacterial pathogen. mBio 2011; 2:e00005–00011 [CrossRef][PubMed]
    [Google Scholar]
  33. Kinsella RJ, Kähäri A, Haider S, Zamora J, Proctor G et al. Ensembl BioMarts: a hub for data retrieval across taxonomic space. Database 2011; 2011:bar030. [CrossRef][PubMed]
    [Google Scholar]
  34. Quintana-Murci L, Fellous M. The Human Y Chromosome: The Biological Role of a Functional Wasteland. J Biomed Biotechnol 2001; 1:18–24 [CrossRef][PubMed]
    [Google Scholar]
  35. Ross MT, Grafham DV, Coffey AJ, Scherer S, McLay K et al. The DNA sequence of the human X chromosome. Nature 2005; 434:325–337 [CrossRef][PubMed]
    [Google Scholar]
  36. Slotkin RK. The case for not masking away repetitive DNA. Mob DNA 2018; 9:15 [CrossRef][PubMed]
    [Google Scholar]
  37. Tian S, Yan H, Neuhauser C, Slager SL. An analytical workflow for accurate variant discovery in highly divergent regions. BMC Genomics 2016; 17:703 [CrossRef][PubMed]
    [Google Scholar]
  38. Lim J, Bae S-C, Kim K. Understanding HLA associations from SNP summary association statistics. Sci Rep 2019; 9:1337 [CrossRef][PubMed]
    [Google Scholar]
  39. Sahl JW, Lemmer D, Travis J, Schupp JM, Gillece JD et al. Nasp: an accurate, rapid method for the identification of SNPs in WGS datasets that supports flexible input and output formats. Microb Genom 2016; 2:e000074-e [CrossRef][PubMed]
    [Google Scholar]
  40. Landrum MJ, Lee JM, Riley GR, Jang W, Rubinstein WS et al. ClinVar: public archive of relationships among sequence variation and human phenotype. Nucleic Acids Res 2014; 42:D980–D985 [CrossRef][PubMed]
    [Google Scholar]
  41. Sturm RA, Duffy DL, Zhao ZZ, Leite FPN, Stark MS et al. A single SNP in an evolutionary conserved region within intron 86 of the HERC2 gene determines human blue-brown eye color. Am J Hum Genet 2008; 82:424–431 [CrossRef][PubMed]
    [Google Scholar]
  42. Morales E, Azocar L, Maul X, Perez C, Chianale J et al. The European lactase persistence genotype determines the lactase persistence state and correlates with gastrointestinal symptoms in the Hispanic and Amerindian Chilean population: a case-control and population-based study. BMJ Open 2011; 1:e000125 [CrossRef][PubMed]
    [Google Scholar]
  43. Macgregor S, Lind PA, Bucholz KK, Hansell NK, Madden PAF et al. Associations of ADH and ALDH2 gene variation with self report alcohol reactions, consumption and dependence: an integrated analysis. Hum Mol Genet 2009; 18:580–593 [CrossRef][PubMed]
    [Google Scholar]
  44. Ballouz S, Dobin A, Gillis JA. Is it time to change the reference genome?. Genome Biol 2019; 20:159 [CrossRef][PubMed]
    [Google Scholar]
  45. Sherman RM, Forman J, Antonescu V, Puiu D, Daya M et al. Assembly of a pan-genome from deep sequencing of 910 humans of African descent. Nat Genet 2019; 51:30–35 [CrossRef][PubMed]
    [Google Scholar]
  46. Southgate JA, Bull MJ, Brown CM, Watkins J, Corden S et al. Influenza classification from short reads with vapor facilitates robust mapping pipelines and zoonotic strain detection for routine surveillance applications. bioRxiv 2019; 597062:
    [Google Scholar]
  47. Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9:357359 [CrossRef][PubMed]
    [Google Scholar]
  48. Broad Institute Picard: A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF. 2018; Available from: http://broadinstitute.github.io/picard/ .
  49. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J et al. The sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25:2078–2079 [CrossRef][PubMed]
    [Google Scholar]
  50. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010; 26:841–842 [CrossRef][PubMed]
    [Google Scholar]
  51. 1000 Genomes Project Consortium Auton A, Brooks LD, Durbin RM, Garrison EP, Abecasis GR, Altshuler DM et al. A global reference for human genetic variation. Nature 2015; 526:68 [CrossRef][PubMed]
    [Google Scholar]
  52. Alexa A, Rahnenführer J, Lengauer T. Improved scoring of functional groups from gene expression data by decorrelating go graph structure. Bioinformatics 2006; 22:1600–1607 [CrossRef][PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000393
Loading
/content/journal/mgen/10.1099/mgen.0.000393
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF

Supplementary material 2

EXCEL

Most cited this month Most Cited RSS feed

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error