1887

Abstract

Phylogenetic analyses are widely used in microbiological research, for example to trace the progression of bacterial outbreaks based on whole-genome sequencing data. In practice, multiple analysis steps such as assembly, alignment and phylogenetic inference are combined to form phylogenetic workflows. Comprehensive benchmarking of the accuracy of complete phylogenetic workflows is lacking. To benchmark different phylogenetic workflows, we simulated bacterial evolution under a wide range of evolutionary models, varying the relative rates of substitution, insertion, deletion, gene duplication, gene loss and lateral gene transfer events. The generated datasets corresponded to a genetic diversity usually observed within bacterial species (≥95 % average nucleotide identity). We replicated each simulation three times to assess replicability. In total, we benchmarked 19 distinct phylogenetic workflows using 8 different simulated datasets. We found that recently developed -mer alignment methods such as kSNP and achieve similar accuracy as reference mapping. The high accuracy of -mer alignment methods can be explained by the large fractions of genomes these methods can align, relative to other approaches. We also found that the choice of assembly algorithm influences the accuracy of phylogenetic reconstruction, with workflows employing SPAdes or outperforming those employing Velvet. Finally, we found that the results of phylogenetic benchmarking are highly variable between replicates. We conclude that for phylogenomic reconstruction, -mer alignment methods are relevant alternatives to reference mapping at the species level, especially in the absence of suitable reference genomes. We show genome assembly accuracy to be an underappreciated parameter required for accurate phylogenomic reconstruction.

  • This is an open-access article distributed under the terms of the Creative Commons Attribution License.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000799
2022-03-15
2024-04-19
Loading full text...

Full text loading...

/deliver/fulltext/mgen/8/3/mgen000799.html?itemId=/content/journal/mgen/10.1099/mgen.0.000799&mimeType=html&fmt=ahah

References

  1. Harris SR, Feil EJ, Holden MTG, Quail MA, Nickerson EK et al. Evolution of MRSA during hospital transmission and intercontinental spread. Science 2010; 327:469–474 [View Article] [PubMed]
    [Google Scholar]
  2. Quainoo S, Coolen JPM, van Hijum SAFT, Huynen MA, Melchers WJG et al. Whole-genome sequencing of bacterial pathogens: the future of nosocomial outbreak analysis. Clin Microbiol Rev 2017; 30:1015–1063 [View Article] [PubMed]
    [Google Scholar]
  3. Lees JA, Kendall M, Parkhill J, Colijn C, Bentley SD et al. Evaluation of phylogenetic reconstruction methods using bacterial whole genomes: a simulation based study. Wellcome Open Res 2018; 3:33 [View Article] [PubMed]
    [Google Scholar]
  4. Timme RE, Rand H, Shumway M, Trees EK, Simmons M et al. Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance. PeerJ 2017; 5:e3893 [View Article] [PubMed]
    [Google Scholar]
  5. Ahrenfeldt J, Skaarup C, Hasman H, Pedersen AG, Aarestrup FM et al. Bacterial whole genome-based phylogeny: construction of a new benchmarking dataset and assessment of some existing methods. BMC Genomics 2017; 18:19 [View Article] [PubMed]
    [Google Scholar]
  6. Hedge J, Wilson DJ. Bacterial phylogenetic reconstruction from whole genomes is robust to recombination but demographic inference is not. mBio 2014; 5:e02158 [View Article] [PubMed]
    [Google Scholar]
  7. McTavish EJ, Pettengill J, Davis S, Rand H, Strain E et al. TreeToReads – a pipeline for simulating raw reads from phylogenies. BMC Bioinformatics 2017; 18:178 [View Article] [PubMed]
    [Google Scholar]
  8. Nell LA. jackalope: a swift, versatile phylogenomic and high-throughput sequencing simulator. Mol Ecol Resour 2020; 20:1132–1140 [View Article] [PubMed]
    [Google Scholar]
  9. Escalona M, Rocha S, Posada D. NGSphy: phylogenomic simulation of next-generation sequencing data. Bioinformatics 2018; 34:2506–2507 [View Article] [PubMed]
    [Google Scholar]
  10. Saber MM, Shapiro BJ. Benchmarking bacterial genome-wide association study methods using simulated genomes and phenotypes. Microb Genom 2020; 6:000337 [View Article] [PubMed]
    [Google Scholar]
  11. Davín AA, Tricou T, Tannier E, de Vienne DM, Szöllősi GJ. Zombi: a phylogenetic simulator of trees, genomes and sequences that accounts for dead linages. Bioinformatics 2020; 36:1286–1288 [View Article] [PubMed]
    [Google Scholar]
  12. Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C. ALF – a simulation framework for genome evolution. Mol Biol Evol 2012; 29:1115–1123 [View Article] [PubMed]
    [Google Scholar]
  13. Cartwright RA. DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics 2005; 21 (Suppl. 3):iii31–iii38 [View Article] [PubMed]
    [Google Scholar]
  14. Bush SJ, Foster D, Eyre DW, Clark EL, De Maio N et al. Genomic diversity affects the accuracy of bacterial single-nucleotide polymorphism-calling pipelines. Gigascience 2020; 9:giaa007 [View Article] [PubMed]
    [Google Scholar]
  15. Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH et al. Sustainable data analysis with Snakemake. F1000Res 2021; 10:33 [View Article] [PubMed]
    [Google Scholar]
  16. Kremer PHC, Lees JA, Koopmans MM, Ferwerda B, Arends AWM et al. Benzalkonium tolerance genes and outcome in Listeria monocytogenes meningitis. Clin Microbiol Infect 2017; 23:265.E1-265.E7 [View Article] [PubMed]
    [Google Scholar]
  17. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014; 30:2068–2069 [View Article] [PubMed]
    [Google Scholar]
  18. Huang W, Li L, Myers JR, Marth GT. ART: a next-generation sequencing read simulator. Bioinformatics 2012; 28:593–594 [View Article] [PubMed]
    [Google Scholar]
  19. Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008; 18:821–829 [View Article] [PubMed]
    [Google Scholar]
  20. Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol 2012; 19:455–477 [View Article] [PubMed]
    [Google Scholar]
  21. Souvorov A, Agarwala R, Lipman DJ. SKESA: strategic k-mer extension for scrupulous assemblies. Genome Biol 2018; 19:153 [View Article] [PubMed]
    [Google Scholar]
  22. Gurevich A, Saveliev V, Vyahhi N, Tesler G. QUAST: quality assessment tool for genome assemblies. Bioinformatics 2013; 29:1072–1075 [View Article] [PubMed]
    [Google Scholar]
  23. Jain C, Rodriguez-R LM, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 2018; 9:5114 [View Article] [PubMed]
    [Google Scholar]
  24. Gardner SN, Slezak T, Hall BG. kSNP3.0: SNP detection and phylogenetic analysis of genomes without genome alignment or reference genome. Bioinformatics 2015; 31:2877–2878 [View Article] [PubMed]
    [Google Scholar]
  25. Harris SKA: split kmer analysis toolkit for bacterial genomic epidemiology. bioRxiv 2018453142 [View Article]
    [Google Scholar]
  26. Sedlazeck FJ, Rescheneder P, von Haeseler A. NextGenMap: fast and accurate read mapping in highly polymorphic genomes. Bioinformatics 2013; 29:2790–2791 [View Article] [PubMed]
    [Google Scholar]
  27. Ning Z, Cox AJ, Mullikin JC. SSAHA: a fast search method for large DNA databases. Genome Res 2001; 11:1725–1729 [View Article] [PubMed]
    [Google Scholar]
  28. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 2015; 31:3691–3693 [View Article] [PubMed]
    [Google Scholar]
  29. Bayliss SC, Thorpe HA, Coyle NM, Sheppard SK, Feil EJ. PIRATE: a fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria. Gigascience 2019; 8:giz119 [View Article] [PubMed]
    [Google Scholar]
  30. Page JA, Taylor B, Keane JA. Multilocus sequence typing by blast from de novo assemblies against PubMLST. J Open Source Softw 2016; 8:118 [View Article]
    [Google Scholar]
  31. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol 2011; 7:539 [View Article] [PubMed]
    [Google Scholar]
  32. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD et al. IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol 2020; 37:1530–1534 [View Article] [PubMed]
    [Google Scholar]
  33. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat Methods 2017; 14:587–589 [View Article] [PubMed]
    [Google Scholar]
  34. Paradis E, Schliep K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 2019; 35:526–528 [View Article] [PubMed]
    [Google Scholar]
  35. Jombart T, Kendall M, Almagro-Garcia J, Colijn C. treespace: statistical exploration of landscapes of phylogenetic trees. Mol Ecol Resour 2017; 17:1385–1392 [View Article] [PubMed]
    [Google Scholar]
  36. Page AJ, Taylor B, Delaney AJ, Soares J, Seemann T et al. SNP-sites: rapid efficient extraction of SNPs from multi-FASTA alignments. Microb Genom 2016; 2:e000056 [View Article] [PubMed]
    [Google Scholar]
  37. Team T pandas development Pandas-dev/pandas: pandas. Zenodo; 2019 https://doi.org/10.5281/zenodo.3509134
  38. Wickham H, Averick M, Bryan J, Chang W, McGowan L et al. Welcome to the Tidyverse. J Open Source Softw 2019; 4:1686 [View Article]
    [Google Scholar]
  39. Wickham H. ggplot2: Elegant Graphics for Data Analysis Cham: Springer; 2016
    [Google Scholar]
  40. Arnold JB. ggthemes: Extra Themes, Scales and Geoms for “ggplot2”, R package version; 2017
  41. Pedersen TL. patchwork: the Composer of ggplots, R package version 00; 2017 https://patchwork.data-imaginist.com/reference/patchwork-package.html#author
  42. Goedhart J. SuperPlotsOfData – a web app for the transparent display and quantitative comparison of continuous data from different conditions. Mol Biol Cell 2021; 32:470–474 [View Article] [PubMed]
    [Google Scholar]
  43. Jones E, Oliphant T, Peterson P. SciPy: Open Source Scientific Tools for Python; 2001 http://www.scipy.org/
  44. Chun J, Oren A, Ventosa A, Christensen H, Arahal DR et al. Proposed minimal standards for the use of genome data for the taxonomy of prokaryotes. Int J Syst Evol Microbiol 2018; 68:461–466 [View Article] [PubMed]
    [Google Scholar]
  45. Greig DR, Jenkins C, Gharbia SE, Dallman TJ. Analysis of a small outbreak of Shiga toxin-producing Escherichia coli O157:H7 using long-read sequencing. Microb Genom 2021; 7:000545 [View Article] [PubMed]
    [Google Scholar]
  46. Quick J, Ashton P, Calus S, Chatt C, Gossain S et al. Rapid draft sequencing and real-time nanopore sequencing in a hospital outbreak of Salmonella. Genome Biol 2015; 16:114 [View Article] [PubMed]
    [Google Scholar]
  47. Wick RR, Schultz MB, Zobel J, Holt KE. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics 2015; 31:3350–3352 [View Article] [PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000799
Loading
/content/journal/mgen/10.1099/mgen.0.000799
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF

Supplementary material 2

EXCEL
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error