1887

Abstract

Comparative analysis of whole-genome sequencing (WGS) data enables fine scaled investigation of transmission and is increasingly becoming part of routine surveillance. However, these analyses are constrained by the computational requirements of the large volumes of data involved. By decomposing WGS reads or assemblies into k-mers and using the dimensionality reduction technique MinHash, it is possible to rapidly approximate genomic distances without alignment. Here we assessed the performance of MinHash, as implemented by sourmash, in predicting single nucleotide differences between genomes (SNPs) and ribotypes (RTs). For a set of 1905 diverse genomes (differing by 0–168 519 SNPs), using sourmash to screen for closely related genomes, at a sensitivity of 100 % for pairs ≤10 SNPs, sourmash reduced the number of pairs from 1 813 560 overall to 161 934, i.e. by 91 %, with a positive predictive value of 32 % to correctly identify pairs ≤10 SNPs (maximum SNP distance 4144). At a sensitivity of 95 %, pairs were reduced by 94 % to 108 266 and PPV increased to 45 % (maximum SNP distance 1009). Increasing the MinHash sketch size above 2000 produced minimal performance improvement. We also explored a MinHash similarity-based ribotype prediction method. Genomes with known ribotypes (=3937) were split into a training set (2937) and test set (1000) randomly. The training set was used to construct a sourmash index against which genomes from the test set were compared. If the closest five genomes in the index had the same ribotype this was taken to predict the searched genome’s ribotype. Using our MinHash ribotype index, predicted ribotypes were correct in 780/1000 (78 %) genomes, incorrect in 20 (2 %), and indeterminant in 200 (20 %). Relaxing the classifier to 4/5 closest matches with the same RT improved the correct predictions to 87 %. Using MinHash it is possible to subsample genome k-mer hashes and use them to approximate small genomic differences within minutes, significantly reducing the search space for further analysis.

Funding
This study was supported by the:
  • NIHR
    • Principle Award Recipient: NotApplicable
  • This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000804
2022-04-06
2024-12-06
Loading full text...

Full text loading...

/deliver/fulltext/mgen/8/4/mgen000804.html?itemId=/content/journal/mgen/10.1099/mgen.0.000804&mimeType=html&fmt=ahah

References

  1. Eyre DW, Cule ML, Wilson DJ, Griffiths D, Vaughan A et al. Diverse sources of C. difficile infection identified on whole-genome sequencing. N Engl J Med 2013; 369:1195–1205 [View Article]
    [Google Scholar]
  2. Gymoese P, Sørensen G, Litrup E, Olsen JE, Nielsen EM et al. Investigation of outbreaks of Salmonella enterica serovar typhimurium and its monophasic variants using whole-genome sequencing, Denmark. Emerg Infect Dis 2017; 23:1631–1639 [View Article] [PubMed]
    [Google Scholar]
  3. Leekitcharoenphon P, Nielsen EM, Kaas RS, Lund O, Aarestrup FM. Evaluation of whole genome sequencing for outbreak detection of Salmonella enterica. PLoS One 2014; 9:e87991 [View Article] [PubMed]
    [Google Scholar]
  4. Jenkins C, Dallman TJ, Grant KA. Impact of whole genome sequencing on the investigation of food-borne outbreaks of Shiga toxin-producing Escherichia coli serogroup O157: H7. Euro Surveill 2019; 24: [View Article] [PubMed]
    [Google Scholar]
  5. Eyre DW, Town K, Street T, Barker L, Sanderson N et al. Detection in the United Kingdom of the Neisseria gonorrhoeae FC428 clone, with ceftriaxone resistance and intermediate resistance to azithromycin. Euro Surveill 2019; 24: [View Article] [PubMed]
    [Google Scholar]
  6. Eyre DW, Sanderson ND, Lord E, Regisford-Reimmer N, Chau K et al. Gonorrhoea treatment failure caused by a Neisseria gonorrhoeae strain with combined ceftriaxone and high-level azithromycin resistance. Euro Surveill 2018; 23: [View Article] [PubMed]
    [Google Scholar]
  7. Eyre DW, Sheppard AE, Madder H, Moir I, Moroney R et al. A candida auris outbreak and its control in an intensive care setting. N Engl J Med 2018; 379:1322–1331 [View Article] [PubMed]
    [Google Scholar]
  8. Eyre DW, Fawley WN, Rajgopal A, Settle C, Mortimer K et al. Comparison of control of Clostridium difficile infection in six english hospitals using whole-genome sequencing. Clin Infect Dis 2017; 65:433–441 [View Article] [PubMed]
    [Google Scholar]
  9. Griffiths D, Fawley W, Kachrimanidou M, Bowden R, Crook DW et al. Multilocus sequence typing of Clostridium difficile. J Clin Microbiol 2010; 48:770–778 [View Article]
    [Google Scholar]
  10. Gupta A, Jordan IK, Rishishwar L. stringMLST: A fast k-mer based tool for multilocus sequence typing. Bioinformatics 2017; 33:119–121 [View Article] [PubMed]
    [Google Scholar]
  11. Bletz S, Janezic S, Harmsen D, Rupnik M, Mellmann A. Defining and evaluating a core genome multilocus sequence typing scheme for genome-wide typing of Clostridium difficile. J Clin Microbiol 2018; 56:e01987-17 [View Article] [PubMed]
    [Google Scholar]
  12. Silva M, Machado MP, Silva DN, Rossi M, Moran-Gilad J et al. chewBBACA: A complete suite for gene-by-gene schema creation and strain identification. Microb Genom 2018; 4: [View Article] [PubMed]
    [Google Scholar]
  13. Eyre DW, Peto TEA, Crook DW, Walker AS, Wilcox MH. Hash-based core genome multi-locus sequencing typing for Clostridium difficile. Microbiology 2019686212 [View Article]
    [Google Scholar]
  14. Gurtler V. Typing of Clostridium difficile strains by PCR-amplification of variable length 16S-23S rDNA spacer regions. J Gen Microbiol 2009
    [Google Scholar]
  15. Williamson CHD, Stone NE, Nunnally AE, Hornstra HM, Wagner DM et al. A global to local genomics analysis of Clostridioides difficile ST1/RT027 identifies cryptic transmission events in a northern Arizona healthcare network. Microb Genom 2019; 5: [View Article] [PubMed]
    [Google Scholar]
  16. Frentrup M, Zhou Z, Steglich M, Meier-Kolthoff JP, Göker M et al. A publicly accessible database for Clostridioides difficile genome sequences supports tracing of transmission chains and epidemics. Microb Genom 2020; 6: [View Article] [PubMed]
    [Google Scholar]
  17. Indyk P, Motwani R. Approximate Nearest Neighbors: Towards Removing the Curse fo Dimensionality. In Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing - STOC 98 1998
    [Google Scholar]
  18. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 2016; 17:132 [View Article] [PubMed]
    [Google Scholar]
  19. Baker DN, Langmead B. Dashing: fast and accurate genomic distances with HyperLogLog. Genome Biol 2019; 20:265 [View Article] [PubMed]
    [Google Scholar]
  20. Zhao X. BinDash, software for fast genome distance estimation on a typical personal laptop. Bioinformatics 2019; 35:671–673 [View Article] [PubMed]
    [Google Scholar]
  21. Titus Brown C, Irber L. sourmash: a library for MinHash sketching of DNA. JOSS 2016; 1:27 [View Article]
    [Google Scholar]
  22. Ondov BD, Starrett GJ, Sappington A, Kostic A, Koren S et al. Mash Screen: High-throughput sequence containment estimation for genome discovery. Genome Biol 2019; 20:232 [View Article] [PubMed]
    [Google Scholar]
  23. Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW et al. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res 2018 [View Article]
    [Google Scholar]
  24. Stubbs SLJ, Brazier JS, O’Neill GL, Duerden BI. PCR targeted to the 16S-23S rRNA gene intergenic spacer region of Clostridium difficile and construction of a library consisting of 116 different PCR ribotypes. J Clin Microbiol 1999; 37:461–463 [View Article]
    [Google Scholar]
  25. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008; 18:821–829 [View Article] [PubMed]
    [Google Scholar]
  26. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 2013; 29:1072–1075 [View Article] [PubMed]
    [Google Scholar]
  27. Seeman T. mlst [Internet]. n.d https://github.com/tseemann/mlst
  28. Hagberg AA, Schult DA, Swart PJ. Exploring network structure, dynamics, and function using NetworkX. In Proceedings of the 7th Python in Science Conference (SciPy) 2008
    [Google Scholar]
  29. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet j 2011; 17:10 [View Article]
    [Google Scholar]
  30. Zhang Q, Awad S, Brown CT. Crossing the streams: a framework for streaming analysis of short DNA sequencing reads. PeerJ PrePrints 2015 [View Article]
    [Google Scholar]
  31. Crusoe MR, Alameldin HF, Awad S, Boucher E, Caldwell A et al. The khmer software package: enabling efficient nucleotide sequence analysis. F1000Res 2015; 4:900 [View Article] [PubMed]
    [Google Scholar]
  32. Li H. wgsim-Read simulator for next generation sequencing 2012 https://github.com/lh3/wgsim
  33. Sebaihia M, Wren BW, Mullany P, Fairweather NF, Minton N et al. The multidrug-resistant human pathogen Clostridium difficile has a highly mobile, mosaic genome. Nat Genet 2006; 38:779–786 [View Article] [PubMed]
    [Google Scholar]
  34. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S et al. Roary: Rapid large-scale prokaryote pan genome analysis. Bioinformatics 2015; 31:3691–3693 [View Article] [PubMed]
    [Google Scholar]
  35. James BP, Arthur WP, Joseph DB, Hugh R, Errol S. Real-time pathogen detection in the era of whole-genome sequencing and big data: Comparison of k-mer and site-based methods for inferring the genetic distances among tens of thousands of salmonella samples. PLoS One 2016
    [Google Scholar]
  36. Pornputtapong N, Acheampong DA, Patumcharoenpol P, Jenjaroenpun P, Wongsurawat T et al. KITSUNE: A tool for identifying empirically optimal K-mer length for alignment-free phylogenomic analysis. Front Bioeng Biotechnol 2020; 8:556413 [View Article] [PubMed]
    [Google Scholar]
  37. Zhang Q, Jun S-R, Leuze M, Ussery D, Nookaew I. Viral phylogenomics using an alignment-free method: A three-step approach to determine optimal length of k-mer. Sci Rep 2017; 7:40712 [View Article] [PubMed]
    [Google Scholar]
  38. Yoshida CE, Kruczkiewicz P, Laing CR, Lingohr EJ, Gannon VPJ et al. The Salmonella in silico typing resource (SISTR): An open web-accessible tool for rapidly typing and subtyping draft Ssalmonella genome assemblies. PLoS One 2016; 11:e0147101 [View Article] [PubMed]
    [Google Scholar]
  39. Břinda K, Callendrello A, Ma KC, MacFadden DR, Charalampous T et al. Rapid inference of antibiotic resistance and susceptibilty by genomic neighbour typing. Nat Microbiol 2020; 5:455–464 [View Article] [PubMed]
    [Google Scholar]
/content/journal/mgen/10.1099/mgen.0.000804
Loading
/content/journal/mgen/10.1099/mgen.0.000804
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF

Supplementary material 2

EXCEL

Supplementary material 3

EXCEL

Supplementary material 4

EXCEL

Supplementary material 5

EXCEL

Supplementary material 6

EXCEL

Supplementary material 7

EXCEL
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error