Benchmarking taxonomic classifiers with Illumina and Nanopore sequence data for clinical metagenomic diagnostic applications

Kumeren N. Govender; David W. Eyre

doi:10.1099/mgen.0.000886

Volume 8, Issue 10

Research Article

Open Access

Benchmarking taxonomic classifiers with Illumina and Nanopore sequence data for clinical metagenomic diagnostic applications

Kumeren N. Govender¹ and David W. Eyre^1,2
View Affiliations Hide Affiliations

Affiliations: ¹ Nuffield Department of Medicine, John Radcliffe Hospital, University of Oxford, Oxford, UK ² Big Data Institute, Nuffield Department of Population Health, University of Oxford, Oxford, UK
*Correspondence: Kumeren N. Govender, [email protected]
Published: 21 October 2022 https://doi.org/10.1099/mgen.0.000886

Abstract

Culture-independent metagenomic detection of microbial species has the potential to provide rapid and precise real-time diagnostic results. However, it is potentially limited by sequencing and taxonomic classification errors. We use simulated and real-world data to benchmark rates of species misclassification using 100 reference genomes for each of the ten common bloodstream pathogens and six frequent blood-culture contaminants (n=1568, only 68 genomes were available for Micrococcus luteus ). Simulating both with and without sequencing error for both the Illumina and Oxford Nanopore platforms, we evaluated commonly used classification tools including Kraken2, Bracken and Centrifuge, utilizing mini (8 GB) and standard (30–50 GB) databases. Bracken with the standard database performed best, the median percentage of reads across both sequencing platforms identified correctly to the species level was 97.8% (IQR 92.7:99.0) [range 5:100]. For Kraken2 with a mini database, a commonly used combination, median species-level identification was 86.4% (IQR 50.5:93.7) [range 4.3:100]. Classification performance varied by species, with Escherichia coli being more challenging to classify correctly (probability of reads being assigned to the correct species: 56.1–96.0%, varying by tool used). Human read misclassification was negligible. By filtering out shorter Nanopore reads we found performance similar or superior to Illumina sequencing, despite higher sequencing error rates. Misclassification was more common when the misclassified species had a higher average nucleotide identity to the true species. Our findings highlight taxonomic misclassification of sequencing data occurs and varies by sequencing and analysis workflow. To account for ‘bioinformatic contamination’ we present a contamination catalogue that can be used in metagenomic pipelines to ensure accurate results that can support clinical decision making.

Received: 14/01/2022
Accepted: 16/08/2022
Published Online: 21/10/2022

Keyword(s): diagnosis , infection , metagenomics , simulation , taxonomic classification and whole-genome sequencing

Funding

This study was supported by the:

NIHR Oxford Biomedical Research Centre
- Principle Award Recipient: NotApplicable
Rhodes Scholarships
- Principle Award Recipient: KumerenNadaraj Govender

This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000886

2022-10-21

2024-05-13

Full text loading...

/deliver/fulltext/mgen/8/10/mgen000886.html?itemId=/content/journal/mgen/10.1099/mgen.0.000886&mimeType=html&fmt=ahah

References

Govender KN, Street TL, Sanderson ND, Eyre DW. Metagenomic sequencing as a pathogen-agnostic clinical diagnostic tool for infectious diseases: a systematic review and meta-analysis of diagnostic test accuracy studies. J Clin Microbiol 2021; 59:e0291620 [View Article]
[Google Scholar]
Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 2014; 15:1–12 [View Article] [PubMed]
[Google Scholar]
Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol 2019; 20:1–13 [View Article] [PubMed]
[Google Scholar]
Lu J, Breitwieser FP, Thielen P, Salzberg SL. Bracken: estimating species abundance in metagenomics data. PeerJ Comput Sci 2017; 3:e104 [View Article]
[Google Scholar]
Walker MA, Pedamallu CS, Ojesina AI, Bullman S, Sharpe T et al. GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts. Bioinformatics 2018; 34:4287–4289 [View Article] [PubMed]
[Google Scholar]
Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nat Commun 2016; 7:1–9 [View Article] [PubMed]
[Google Scholar]
Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods 2015; 12:59–60 [View Article] [PubMed]
[Google Scholar]
Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods 2015; 12:902–903 [View Article] [PubMed]
[Google Scholar]
Milanese A, Mende DR, Paoli L, Salazar G, Ruscheweyh H-J et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat Commun 2019; 10:1–11 [View Article] [PubMed]
[Google Scholar]
Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking metagenomics tools for taxonomic classification. Cell 2019; 178:779–794 [View Article]
[Google Scholar]
Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods 2007; 4:495–500 [View Article] [PubMed]
[Google Scholar]
Bremges A, McHardy AC. Critical assessment of metagenome interpretation enters the second round. mSystems 2018; 3:e00103-18 [View Article]
[Google Scholar]
Meyer F, Bremges A, Belmann P, Janssen S, McHardy AC et al. Assessing taxonomic metagenome profilers with OPAL. Genome Biol 2019; 20:1–10 [View Article] [PubMed]
[Google Scholar]
Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S et al. Critical assessment of metagenome interpretation-a benchmark of metagenomics software. Nat Methods 2017; 14:1063–1071 [View Article]
[Google Scholar]
McIntyre ABR, Ounit R, Afshinnekoo E, Prill RJ, Hénaff E et al. Comprehensive benchmarking and ensemble approaches for metagenomic classifiers. Genome Biol 2017; 18:1–19 [View Article] [PubMed]
[Google Scholar]
McArdle AJ, Kaforou M. Sensitivity of shotgun metagenomics to host DNA: abundance estimates depend on bioinformatic tools and contamination is the main issue. Access Microbiol 2020; 2:acmi000104 [View Article]
[Google Scholar]
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 2016; 44:D733–45 [View Article]
[Google Scholar]
Karst SM, Ziels RM, Kirkegaard RH, Sørensen EA, McDonald D et al. High-accuracy long-read amplicon sequences using unique molecular identifiers with nanopore or PacBio sequencing. Nat Methods 2021; 18:165–169 [View Article]
[Google Scholar]
Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics 2012; 28:593–594 [View Article] [PubMed]
[Google Scholar]
Street TL, Sanderson ND, Kolenda C, Kavanagh J, Pickford H et al. Clinical metagenomic sequencing for species identification and antimicrobial resistance prediction in orthopedic device infection. J Clin Microbiol 2022; 60:e02156–21 [View Article]
[Google Scholar]
Yang C, Chu J, Warren RL, Birol I. NanoSim: nanopore sequence read simulator based on statistical characterization. Bioinformatics 2017; 6 [View Article]
[Google Scholar]
Břinda K, Yang C. NanoSim-H (version 1.1.0.4); 2018
Salzberg SL, Wood DE. Releasing the Kraken. Front Bioinform 2021; 75: [View Article]
[Google Scholar]
Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res 2016; 26:1721–1729 [View Article] [PubMed]
[Google Scholar]
Lipworth S, Vihta K-D, Chau K, Barker L, George S et al. Ten-year longitudinal molecular epidemiology study of Escherichia coli and Klebsiella species bloodstream infections in Oxfordshire, UK. Genome Med 2021; 13:144 [View Article] [PubMed]
[Google Scholar]
Young BC, Wu C-H, Charlesworth J, Earle S, Price JR et al. Antimicrobial resistance determinants are associated with Staphylococcus aureus bacteraemia and adaptation to the healthcare environment: a bacterial genome-wide association study. Microb Genom 2021; 7:700 [View Article] [PubMed]
[Google Scholar]
Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 2018; 9:1–8 [View Article] [PubMed]
[Google Scholar]
Koenker R, Portnoy S, Ng PT, Zeileis A, Grosjean P et al. n.d. Package ‘ quantreg. Cran R-project org
[Google Scholar]
Zheng W, Tan TK, Paterson IC, Mutha NVR, Siow CC et al. StreptoBase: an oral Streptococcus mitis group genomic resource and analysis platform. PLoS One 2016; 11:e0151908 [View Article]
[Google Scholar]
Ehling-Schulz M, Lereclus D, Koehler TM. The Bacillus cereus group: Bacillus species with pathogenic potential. Microbiol Spectr 2019; 7:1128/microbiolspec.GPP3-0032–2018
[Google Scholar]
Morand PC, Billoet A, Rottman M, Sivadon-Tardy V, Eyrolle L et al. Specific distribution within the Enterobacter cloacae complex of strains isolated from infected orthopedic implants. J Clin Microbiol 2009; 47:2489–2495 [View Article] [PubMed]
[Google Scholar]
Chiu CY, Miller SA. Clinical metagenomics. Nat Rev Genet 2019; 20:341–355 [View Article] [PubMed]
[Google Scholar]
Blauwkamp TA, Thair S, Rosen MJ, Blair L, Lindner MS et al. Analytical and clinical validation of a microbial cell-free DNA sequencing test for infectious disease. Nat Microbiol 2019; 4:663–674 [View Article] [PubMed]
[Google Scholar]
Liang Q, Bible PW, Liu Y, Zou B, Wei L. DeepMicrobes: taxonomic classification for metagenomics with deep learning. NAR Genom Bioinform 2020; 2:lqaa009 [View Article]
[Google Scholar]
Tedersoo L, Albertsen M, Anslan S, Callahan B. Perspectives and benefits of high-throughput long-read sequencing in microbial ecology. Appl Environ Microbiol 2021; 87:e00626–21 [View Article]
[Google Scholar]
Amarasinghe SL, Su S, Dong X, Zappia L, Ritchie ME et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol 2020; 21:1–16 [View Article] [PubMed]
[Google Scholar]
Wick RR, Judd LM, Holt KE. Performance of neural network basecalling tools for Oxford Nanopore sequencing. Genome Biol 2019; 20:129 [View Article]
[Google Scholar]
Pearman WS, Freed NE, Silander OK. Testing the advantages and disadvantages of short- and long- read eukaryotic metagenomics using simulated reads. BMC Bioinformatics 2020; 21:1–15 [View Article] [PubMed]
[Google Scholar]

http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000886

Benchmarking taxonomic classifiers with Illumina and Nanopore sequence data for clinical metagenomic diagnostic applications

M Gen 8, 000886 (2022); https://doi.org/10.1099/mgen.0.000886

/content/journal/mgen/10.1099/mgen.0.000886

Volume 8, Issue 10

Research Article

Open Access

Benchmarking taxonomic classifiers with Illumina and Nanopore sequence data for clinical metagenomic diagnostic applications

Abstract

Funding

Supplementary material 1

Supplementary material 2

Most read this month

Most cited Most Cited RSS feed

Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification

Completing bacterial genome assemblies with multiplex MinION sequencing

MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies

ResFinder – an open online resource for identification of antimicrobial resistance genes in next-generation sequencing data and prediction of phenotypes from genotypes

SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments

ClermonTyping: an easy-to-use and accurate in silico method for Escherichia genus strain phylotyping

chewBBACA: A complete suite for gene-by-gene schema creation and strain identification

Identification of Klebsiella capsule synthesis loci from whole genome data

Emergence, molecular mechanisms and global spread of carbapenem-resistant Acinetobacter baumannii

Kaptive 2.0: updated capsule and lipopolysaccharide locus typing for the Klebsiella pneumoniae species complex