1887

Abstract

Accurate orthologue identification is a vital component of bacterial comparative genomic studies, but many popular sequence-similarity-based approaches do not scale well to the large numbers of genomes that are now generated routinely. Furthermore, most approaches do not take gene synteny into account, which is useful information for disentangling paralogues. Here, we present SynerClust, a user-friendly synteny-aware tool based on synergy that can process thousands of genomes. SynerClust was designed to analyse genomes with high levels of local synteny, particularly prokaryotes, which have operon structure. SynerClust’s run-time is optimized by selecting cluster representatives at each node in the phylogeny; thus, avoiding the need for exhaustive pairwise similarity searches. In benchmarking against Roary, Hieranoid2, PanX and Reciprocal Best Hit, SynerClust was able to more completely identify sets of core genes for datasets that included diverse strains, while using substantially less memory, and with scalability comparable to the fastest tools. Due to its scalability, ease of installation and use, and suitability for a variety of computing environments, orthogroup clustering using SynerClust will enable many large-scale prokaryotic comparative genomics efforts.

  • This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000231
2018-11-12
2024-03-28
Loading full text...

Full text loading...

/deliver/fulltext/mgen/4/11/mgen000231.html?itemId=/content/journal/mgen/10.1099/mgen.0.000231&mimeType=html&fmt=ahah

References

  1. Salichos L, Rokas A. Evaluating ortholog prediction algorithms in a yeast model clade. PLoS One 2011; 6:e18755 [View Article][PubMed]
    [Google Scholar]
  2. Li L, Stoeckert CJ, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Res 2003; 13:2178–2189 [View Article][PubMed]
    [Google Scholar]
  3. Fouts DE, Brinkac L, Beck E, Inman J, Sutton G. PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res 2012; 40:e172 [View Article][PubMed]
    [Google Scholar]
  4. Zhao Y, Wu J, Yang J, Sun S, Xiao J et al. PGAP: pan-genomes analysis pipeline. Bioinformatics 2012; 28:416–418 [View Article][PubMed]
    [Google Scholar]
  5. Sonnhammer EL, Gabaldón T, Sousa da Silva AW, Martin M, Robinson-Rechavi M et al. Big data and other challenges in the quest for orthologs. Bioinformatics 2014; 30:2993–2998 [View Article][PubMed]
    [Google Scholar]
  6. Kaduk M, Sonnhammer E. Improved orthology inference with Hieranoid 2. Bioinformatics 2017; 33:1154–1159 [View Article][PubMed]
    [Google Scholar]
  7. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 2015; 31:3691–3693 [View Article][PubMed]
    [Google Scholar]
  8. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012; 28:3150–3152 [View Article][PubMed]
    [Google Scholar]
  9. Sahl JW, Caporaso JG, Rasko DA, Keim P. The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes. PeerJ 2014; 2:e332 [View Article][PubMed]
    [Google Scholar]
  10. Ding W, Baumdicker F, Neher RA. panX: pan-genome analysis and exploration. Nucleic Acids Res 2018; 46:e5 [View Article][PubMed]
    [Google Scholar]
  11. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods 2015; 12:59–60 [View Article][PubMed]
    [Google Scholar]
  12. Jacob F, Perrin D, Sanchez C, Monod J. Operon: a group of genes with the expression coordinated by an operator. C R Hebd Seances Acad Sci 1960; 250:1727–1729[PubMed]
    [Google Scholar]
  13. Rogozin IB, Makarova KS, Natale DA, Spiridonov AN, Tatusov RL et al. Congruent evolution of different classes of non-coding DNA in prokaryotic genomes. Nucleic Acids Res 2002; 30:4264–4271 [View Article][PubMed]
    [Google Scholar]
  14. Wolf YI, Rogozin IB, Kondrashov AS, Koonin EV. Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Res 2001; 11:356–372 [View Article][PubMed]
    [Google Scholar]
  15. Junier I, Rivoire O. Conserved units of co-expression in bacterial genomes: an evolutionary insight into transcriptional regulation. PLoS One 2016; 11:e0155740 [View Article][PubMed]
    [Google Scholar]
  16. Ali RH, Muhammad SA, Arvestad L. GenFamClust: an accurate, synteny-aware and reliable homology inference algorithm. BMC Evol Biol 2016; 16:120 [View Article][PubMed]
    [Google Scholar]
  17. Lechner M, Hernandez-Rosales M, Doerr D, Wieseke N, Thévenin A et al. Orthology detection combining clustering and synteny for very large datasets. PLoS One 2014; 9:e105015 [View Article][PubMed]
    [Google Scholar]
  18. Wapinski I, Pfeffer A, Friedman N, Regev A. Automatic genome-wide reconstruction of phylogenetic gene trees. Bioinformatics 2007; 23:i549–i558 [View Article][PubMed]
    [Google Scholar]
  19. Wapinski I, Pfeffer A, Friedman N, Regev A. Natural history and evolutionary principles of gene duplication in fungi. Nature 2007; 449:54–61 [View Article][PubMed]
    [Google Scholar]
  20. McGuire AM, Weiner B, Park ST, Wapinski I, Raman S et al. Comparative analysis of mycobacterium and related actinomycetes yields insight into the evolution of Mycobacterium tuberculosis pathogenesis. BMC Genomics 2012; 13:120 [View Article][PubMed]
    [Google Scholar]
  21. Price MN, Dehal PS, Arkin AP. FastTree 2 – approximately maximum-likelihood trees for large alignments. PLoS One 2010; 5:e9490 [View Article][PubMed]
    [Google Scholar]
  22. Altenhoff AM, Boeckmann B, Capella-Gutierrez S, Dalquen DA, DeLuca T et al. Standardized benchmarking in the quest for orthologs. Nat Methods 2016; 13:425–430 [View Article][PubMed]
    [Google Scholar]
  23. Schlicker A, Domingues FS, Rahnenführer J, Lengauer T. A new measure for functional similarity of gene products based on gene ontology. BMC Bioinformatics 2006; 7:302 [View Article][PubMed]
    [Google Scholar]
  24. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H et al. Gene Ontology: tool for the unification of biology. Nat Genet 2000; 25:25–29 [View Article]
    [Google Scholar]
  25. International Union of Biochemistry and Molecular Biology Biochemical nomenclature, and enzyme nomenclature. Announcements. Eur J Biochem 1993; 213:1 [View Article]
    [Google Scholar]
  26. Ogata H, Goto S, Sato K, Fujibuchi W, Bono H et al. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res 1999; 27:29–34 [View Article][PubMed]
    [Google Scholar]
  27. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ et al. The Pfam protein families database. Nucleic Acids Res 2008; 36:D281–D288 [View Article][PubMed]
    [Google Scholar]
  28. Sonnhammer EL, Östlund G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res 2015; 43:D234–D239 [View Article][PubMed]
    [Google Scholar]
  29. Manson McGuire A, Cochrane K, Griggs AD, Haas BJ, Abeel T et al. Evolution of invasion in a diverse set of Fusobacterium species. MBio 2014; 5:e01864 [View Article][PubMed]
    [Google Scholar]
  30. Lebreton F, Manson AL, Saavedra JT, Straub TJ, Earl AM et al. Tracing the enterococci from Paleozoic origins to the hospital. Cell 2017; 169:849–861 [View Article][PubMed]
    [Google Scholar]
  31. Merkel D. Docker: lightweight Linux containers for consistent development and deployment. Linux J 2014; 2014:2
    [Google Scholar]
  32. Kent WJ. BLAT–the BLAST-like alignment tool. Genome Res 2002; 12:656–664 [View Article][PubMed]
    [Google Scholar]
  33. Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 2010; 26:2460–2461 [View Article][PubMed]
    [Google Scholar]
  34. Wu M, Eisen JA. A simple, fast, and accurate method of phylogenomic inference. Genome Biol 2008; 9:R151 [View Article][PubMed]
    [Google Scholar]
  35. Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res 2004; 32:1792–1797 [View Article][PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000231
Loading
/content/journal/mgen/10.1099/mgen.0.000231
Loading

Data & Media loading...

Supplements

Supplementary File 1

Supplementary File 2

PDF
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error