Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel Open Access

Abstract

Hierarchical genotyping approaches can provide insights into the source, geography and temporal distribution of bacterial pathogens. Multiple hierarchical SNP genotyping schemes have previously been developed so that new isolates can rapidly be placed within pre-computed population structures, without the need to rebuild phylogenetic trees for the entire dataset. This classification approach has, however, seen limited uptake in routine public health settings due to analytical complexity and the lack of standardized tools that provide clear and easy ways to interpret results. The BioHansel tool was developed to provide an organism-agnostic tool for hierarchical SNP-based genotyping. The tool identifies split k-mers that distinguish predefined lineages in whole genome sequencing (WGS) data using SNP-based genotyping schemes. BioHansel uses the Aho-Corasick algorithm to type isolates from assembled genomes or raw read sequence data in a matter of seconds, with limited computational resources. This makes BioHansel ideal for use by public health agencies that rely on WGS methods for surveillance of bacterial pathogens. Genotyping results are evaluated using a quality assurance module which identifies problematic samples, such as low-quality or contaminated datasets. Using existing hierarchical SNP schemes for and Typhi, we compare the genotyping results obtained with the k-mer-based tools BioHansel and SKA, with those of the organism-specific tools TBProfiler and genotyphi, which use gold-standard reference-mapping approaches. We show that the genotyping results are fully concordant across these different methods, and that the k-mer-based tools are significantly faster. We also test the ability of the BioHansel quality assurance module to detect intra-lineage contamination and demonstrate that it is effective, even in populations with low genetic diversity. We demonstrate the scalability of the tool using a dataset of ~8100 . Typhi public genomes and provide the aggregated results of geographical distributions as part of the tool’s output. BioHansel is an open source Python 3 application available on PyPI and Conda repositories and as a Galaxy tool from the public Galaxy Toolshed. In a public health context, BioHansel enables rapid and high-resolution classification of bacterial pathogens with low genetic diversity.

Funding
This study was supported by the:
  • Genomics Research and Development Initiative (GRDI) of the Government of Canada (Award Grants ID 2256344 and 2267905)
    • Principle Award Recipient: RogerP. Johnson
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000651
2021-09-23
2024-03-29
Loading full text...

Full text loading...

/deliver/fulltext/mgen/7/9/mgen000651.html?itemId=/content/journal/mgen/10.1099/mgen.0.000651&mimeType=html&fmt=ahah

References

  1. Deurenberg RH, Bathoorn E, Chlebowicz MA, Couto N, Ferdous M et al. Application of next generation sequencing in clinical microbiology and infection prevention. J Biotechnol 2017; 243:16–24 [View Article] [PubMed]
    [Google Scholar]
  2. Nadon C, Walle V, Gerner-Smidt P, Campos J, Chinen I et al. PulseNet International: Vision for the implementation of whole genome sequencing (WGS) for global food-borne disease surveillance. Euro Surveill Bull Eur Sur Mal Transm Eur Commun Dis Bull 2017; 22:30544
    [Google Scholar]
  3. Wong VK, Baker S, Connor TR, Pickard D, Page AJ et al. An extended genotyping framework for Salmonella enterica serovar Typhi, the cause of human typhoid. Nat Commun 2016; 7:12827 [View Article] [PubMed]
    [Google Scholar]
  4. Coll F, McNerney R, Guerra-Assunção JA, Glynn JR, Perdigão J et al. A robust SNP barcode for typing Mycobacterium tuberculosis complex strains. Nat Commun 2014; 5:4812 [View Article] [PubMed]
    [Google Scholar]
  5. Harris SR. SKA: Split Kmer Analysis Toolkit for Bacterial Genomic Epidemiology. bioRxiv 2018453142
    [Google Scholar]
  6. PHG Foundation Pathogen genomics into practice; 2020 https://www.phgfoundation.org/report/pathogen-genomics-into-practice
  7. Jagadeesan B, Baert L, Wiedmann M, Orsi RH. Comparative analysis of tools and approaches for source tracking Listeria monocytogenes in a food facility using whole-genome sequence data. Front Microbiol 2019; 10:947 [View Article] [PubMed]
    [Google Scholar]
  8. Dallman T, Ashton P, Schafer U, Jironkin A, Painset A et al. SnapperDB: a database solution for routine sequencing analysis of bacterial isolates. Bioinformatics 2018; 34:3028–3029 [View Article] [PubMed]
    [Google Scholar]
  9. Petkau A, Mabon P, Sieffert C, Knox NC, Cabral J et al. SNVPhyl: a single nucleotide variant phylogenomics pipeline for microbial genomic epidemiology. Microb Genom 2017; 3:e000116 [View Article]
    [Google Scholar]
  10. Pearce ME, Alikhan N-F, Dallman TJ, Zhou Z, Grant K et al. Comparative analysis of core genome MLST and SNP typing within a European Salmonella serovar Enteritidis outbreak. Int J Food Microbiol 2018; 274:1–11 [View Article] [PubMed]
    [Google Scholar]
  11. Katz LS, Griswold T, Williams-Newkirk AJ, Wagner D, Petkau A et al. A comparative analysis of the Lyve-SET phylogenomics pipeline for genomic epidemiology of foodborne pathogens. Front Microbiol 2017; 8:375 [View Article] [PubMed]
    [Google Scholar]
  12. Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines. GigaScience 2020; 9:giaa007 [View Article] [PubMed]
    [Google Scholar]
  13. Yoshimura D, Kajitani R, Gotoh Y, Katahira K, Okuno M et al. Evaluation of SNP calling methods for closely related bacterial isolates and a novel high-accuracy pipeline: BactSNP. Microb Genomics 2019; 5:e000261 [View Article]
    [Google Scholar]
  14. Cameron DL, Di Stefano L, Papenfuss AT. Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software. Nat Commun 2019; 10:3240 [View Article] [PubMed]
    [Google Scholar]
  15. Davis S, Pettengill JB, Luo Y, Payne J, Shpuntoff A et al. CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data. PeerJ Comput Sci 2015; 1:e20 [View Article]
    [Google Scholar]
  16. Seeman, Torsten snippy: fast bacterial variant calling from NGS reads; 2020 https://github.com/tseemann/snippy
  17. Labbé G, Rankin MA, Robertson J, Moffat J, Giang E et al. Targeting discriminatory SNPs in Salmonella enterica serovar Heidelberg genomes using RNase H2-dependent PCR. J Microbiol Methods 2019; 157:81–87 [View Article] [PubMed]
    [Google Scholar]
  18. van Gent M, Bart MJ, van der Heide HGJ, Heuvelman KJ, Kallonen T et al. SNP-based typing: a useful tool to study Bordetella pertussis populations. PLoS One 2011; 6:e20340 [View Article] [PubMed]
    [Google Scholar]
  19. Hunt M, Mather AE, Sánchez-Busó L, Page AJ, Parkhill J et al. ARIBA: rapid antimicrobial resistance genotyping directly from sequencing reads. Microb Genomics 2017; 3:e000131–e000131 [View Article]
    [Google Scholar]
  20. Silva M, Machado MP, Silva DN, Rossi M, Moran-Gilad J et al. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. Microb Genomics 2018; 4:e000166 [View Article]
    [Google Scholar]
  21. Feijao P, Yao H-T, Fornika D, Gardy J, Hsiao W et al. MentaLiST - a fast MLST caller for large MLST schemes. Microb Genom 2018; 4: [View Article] [PubMed]
    [Google Scholar]
  22. Inouye M, Dashnow H, Raven L-A, Schultz MB, Pope BJ et al. SRST2: Rapid genomic surveillance for public health and hospital microbiology labs. Genome Med 2014; 6:90 [View Article] [PubMed]
    [Google Scholar]
  23. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 2016; 17:132 [View Article] [PubMed]
    [Google Scholar]
  24. Low AJ, Koziol AG, Manninger PA, Blais B, Carrillo CD. ConFindr: rapid detection of intraspecies and cross-species contamination in bacterial whole-genome sequence data. PeerJ 2019; 7:e6995 [View Article] [PubMed]
    [Google Scholar]
  25. Wyllie DH, Robinson E, Peto T, Crook DW, Ajileye A et al. Identifying mixed Mycobacterium tuberculosis infection and laboratory cross-contamination during mycobacterial sequencing programs. J Clin Microbiol 2018; 56:e00923-18 [View Article]
    [Google Scholar]
  26. Kohl TA, Utpatel C, Schleusener V, Filippo MRD, Beckert P et al. MTBseq: a comprehensive pipeline for whole genome sequence analysis of Mycobacterium tuberculosis complex isolates. PeerJ 2018; 6:e5895 [View Article] [PubMed]
    [Google Scholar]
  27. Anyansi C, Keo A, Walker BJ, Straub TJ, Manson AL et al. QuantTB – a method to classify mixed Mycobacterium tuberculosis infections within whole genome sequencing data. BMC Genomics 2020; 21:80 [View Article] [PubMed]
    [Google Scholar]
  28. Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol 2014; 15:R46 [View Article] [PubMed]
    [Google Scholar]
  29. Kim D, Song L, Breitwieser FP, Salzberg SL. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res 2016; 26:1721–1729 [View Article] [PubMed]
    [Google Scholar]
  30. Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for Metagenomics with kaiju. Nat Commun 2016; 7:11257 [View Article]
    [Google Scholar]
  31. Sczyrba A, Hofmann P, Belmann P, Koslicki D, Janssen S et al. Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software. Nat Methods 2017; 14:1063–1071 [View Article] [PubMed]
    [Google Scholar]
  32. Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking metagenomics tools for taxonomic classification. Cell 2019; 178:779–794 [View Article] [PubMed]
    [Google Scholar]
  33. Wright ES, Vetsigian KH. Quality filtering of Illumina index reads mitigates sample cross-talk. BMC Genomics 2016; 17:876 [View Article] [PubMed]
    [Google Scholar]
  34. Phelan JE, O’Sullivan DM, Machado D, Ramos J, Oppong YEA et al. Integrating informatics tools and portable sequencing technology for rapid detection of resistance to anti-tuberculous drugs. Genome Med 2019; 11:41 [View Article] [PubMed]
    [Google Scholar]
  35. Aho AV, Corasick MJ. Efficient string matching: an aid to bibliographic search. Commun ACM 1975; 18:333–340
    [Google Scholar]
  36. Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017; 13:e1005595 [View Article] [PubMed]
    [Google Scholar]
  37. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics 2012; 28:593–594 [View Article] [PubMed]
    [Google Scholar]
  38. Yoshida CE, Kruczkiewicz P, Laing CR, Lingohr EJ, Gannon VPJ et al. The Salmonella In Silico Typing Resource (SISTR): an open web-accessible tool for rapidly typing and subtyping draft Salmonella genome assemblies. PLOS ONE 2016; 11:e0147101 [View Article] [PubMed]
    [Google Scholar]
  39. Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M. A genomic overview of the population structure of Salmonella. PLOS Genetics 2018; 14:e1007261
    [Google Scholar]
  40. Zhou Z, Alikhan N-F, Mohamed K, Fan Y. Agama Study Group et al. The EnteroBase user’s guide, with case studies on Salmonella transmissions, Yersinia pestis phylogeny, and Escherichia core genomic diversity. Genome Res 2020; 30:138–152 [View Article] [PubMed]
    [Google Scholar]
  41. Achtman M, Zhou Z, Alikhan N-F, Tyne W, Parkhill J et al. Genomic diversity of Salmonella enterica -The UoWUCC 10K genomes project. Wellcome Open Res 2020; 5:223 [View Article] [PubMed]
    [Google Scholar]
  42. Uelze L, Borowiak M, Deneke C, Szabó I, Fischer J et al. Performance and accuracy of four open-source tools for in silico serotyping of Salmonella spp. based on whole-genome short-read sequencing data. Appl Environ Microbiol 2020; 86:e02265-19 [View Article]
    [Google Scholar]
  43. Li S, Zhang S, Deng X. GC content-associated sequencing bias caused by library preparation method may infrequently affect Salmonella serotype prediction using SeqSero2. Appl Environ Microbiol 2020; 86:e00614-20 [View Article]
    [Google Scholar]
  44. Riojas MA, McGough KJ, Rider-Riojas CJ, Rastogi N, Hazbón MH. Phylogenomic analysis of the species of the Mycobacterium tuberculosis complex demonstrates that Mycobacterium africanum, Mycobacterium bovis, Mycobacterium caprae, Mycobacterium microti and Mycobacterium pinnipedii are later heterotypic synonyms of Mycobacterium tuberculosis. Int J Syst Evol Microbiol 2018; 68:324–332 [View Article] [PubMed]
    [Google Scholar]
  45. Pightling AW, Pettengill JB, Wang Y, Rand H, Strain E. Within-species contamination of bacterial whole-genome sequence data has a greater influence on clustering analyses than between-species contamination. Genome Biol 2019; 20:286 [View Article] [PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000651
Loading
/content/journal/mgen/10.1099/mgen.0.000651
Loading

Data & Media loading...

Supplements

Supplementary material 1

EXCEL

Most cited Most Cited RSS feed