1887

Abstract

The Genome Taxonomy Database (GTDB) provides a species to domain classification of publicly available genomes based on average nucleotide identity (ANI) (for species) and a concatenated gene phylogeny normalized by evolutionary rates (for genus to phylum), which has been widely adopted by the scientific community. Here, we use the Genome UNClutterer (GUNC) software to identify putatively contaminated genomes in GTDB release 07-RS207. We found that GUNC reported 35,723 genomes as putatively contaminated, comprising 11.25 % of the 317,542 genomes in GTDB release 07-RS207. To assess the impact of this high level of inferred contamination on the delineation of taxa, we created ‘clean’ versions of the 34,846 putatively contaminated bacterial genomes by removing the most contaminated half. For each clean half, we re-calculated the ANI and concatenated gene phylogeny and found that only 77 (0.22 %) of the genomes were not consistent with their original classification. We conclude that the delineation of taxa in GTDB is robust to the putative contamination detected by GUNC.

  • This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.001256
2024-05-29
2024-06-19
Loading full text...

Full text loading...

/deliver/fulltext/mgen/10/5/mgen001256.html?itemId=/content/journal/mgen/10.1099/mgen.0.001256&mimeType=html&fmt=ahah

References

  1. Chen LX, Anantharaman K, Shaiber A, Eren AM, Banfield JF. Accurate and complete genomes from metagenomes. Genome Res 2020; 30:315–333 [View Article] [PubMed]
    [Google Scholar]
  2. Cornet L, Baurain D. Contamination detection in genomic data: more is not enough. Genome Biol 2022; 23:60 [View Article] [PubMed]
    [Google Scholar]
  3. Meziti A, Rodriguez-R LM, Hatt JK, Peña-Gonzalez A, Levy K et al. The reliability of Metagenome-Assembled Genomes (MAGs) in representing natural populations: insights from comparing MAGs against isolate genomes derived from the same fecal sample. Appl Environ Microbiol 2021; 87:e02593-20 [View Article] [PubMed]
    [Google Scholar]
  4. Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol 2014; 12:87 [View Article] [PubMed]
    [Google Scholar]
  5. Orakov A, Fullam A, Coelho LP, Khedkar S, Szklarczyk D et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biol 2021; 22:178 [View Article] [PubMed]
    [Google Scholar]
  6. Parks DH, Chuvochina M, Rinke C, Mussig AJ, Chaumeil P-A et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy. Nucleic Acids Res 2022; 50:D785–D794 [View Article] [PubMed]
    [Google Scholar]
  7. O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res 2016; 44:D733–D745 [View Article] [PubMed]
    [Google Scholar]
  8. Hyatt D, Chen GL, Locascio PF, Land ML, Larimer FW et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 2010; 11:119 [View Article] [PubMed]
    [Google Scholar]
  9. Mende DR, Letunic I, Maistrenko OM, Schmidt TSB, Milanese A et al. proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res 2020; 48:D621–D625 [View Article] [PubMed]
    [Google Scholar]
  10. Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol 2016; 17:132 [View Article] [PubMed]
    [Google Scholar]
  11. Jain C, Rodriguez RL, Phillippy AM, Konstantinidis KT, Aluru S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat Commun 2018; 9:5114 [View Article] [PubMed]
    [Google Scholar]
  12. Parks DH, Chuvochina M, Chaumeil P-A, Rinke C, Mussig AJ et al. A complete domain-to-species taxonomy for bacteria and Archaea. Nat Biotechnol 2020; 38:1079–1086 [View Article] [PubMed]
    [Google Scholar]
  13. Price MN, Dehal PS, Arkin AP. FastTree 2--approximately maximum-likelihood trees for large alignments. PLoS One 2010; 5:e9490 [View Article] [PubMed]
    [Google Scholar]
  14. Whelan S, Goldman N. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol 2001; 18:691–699 [View Article] [PubMed]
    [Google Scholar]
  15. Westram R, Bader K, Prüsse E, Kumar Y, Meier H et al. ARB: a software environment for sequence data. In De Bruijn FJ. ed Handbook of Molecular Microbial Ecology I Wiley; 2011 pp 399–406 [View Article]
    [Google Scholar]
  16. Rouli L, Merhej V, Fournier PE, Raoult D. The bacterial pangenome as a new tool for analysing pathogenic bacteria. New Microbes New Infect 2015; 7:72–85 [View Article] [PubMed]
    [Google Scholar]
  17. Li T, Yin Y. Critical assessment of pan-genomic analysis of metagenome-assembled genomes. Brief Bioinform 2022; 23:bbac413 [View Article] [PubMed]
    [Google Scholar]
  18. Konstantinidis KT, Ramette A, Tiedje JM. The bacterial species definition in the genomic era. Philos Trans R Soc B Biol Sci 2006; 361:1929–1940 [View Article] [PubMed]
    [Google Scholar]
  19. Steenwyk JL, Li Y, Zhou X, Shen X-X, Rokas A. Incongruence in the phylogenomics era. Nat Rev Genet 2023; 24:834–850 [View Article] [PubMed]
    [Google Scholar]
  20. Gosselin S, Fullmer MS, Feng Y, Gogarten JP. Improving phylogenies based on average nucleotide identity, incorporating saturation correction and nonparametric bootstrap support. Syst Biol 2022; 71:396–409 [View Article] [PubMed]
    [Google Scholar]
  21. Shaw J, Yu YW. Fast and robust metagenomic sequence comparison through sparse chaining with skani. Nat Methods 2023; 20:1661–1665 [View Article] [PubMed]
    [Google Scholar]
  22. Bobay LM, Ochman H. Biological species are universal across Life’s domains. Genome Biol Evol 2017; 9:491–501 [View Article] [PubMed]
    [Google Scholar]
  23. Hedlund BP, Chuvochina M, Hugenholtz P, Konstantinidis KT, Murray AE et al. SeqCode: a nomenclatural code for prokaryotes described from sequence data. Nat Microbiol 2022; 7:1702–1708 [View Article] [PubMed]
    [Google Scholar]
  24. Whitman WB, Chuvochina M, Hedlund BP, Hugenholtz P, Konstantinidis KT et al. Development of the SeqCode: a proposed nomenclatural code for uncultivated prokaryotes with DNA sequences as type. Syst Appl Microbiol 2022; 45:126305 [View Article] [PubMed]
    [Google Scholar]
  25. Moss EL, Maghini DG, Bhatt AS. Complete, closed bacterial genomes from microbiomes using nanopore sequencing. Nat Biotechnol 2020; 38:701–707 [View Article] [PubMed]
    [Google Scholar]
  26. Singleton CM, Petriglieri F, Kristensen JM, Kirkegaard RH, Michaelsen TY et al. Connecting structure to function with the recovery of over 1000 high-quality metagenome-assembled genomes from activated sludge using long-read sequencing. Nat Commun 2021; 12:2009 [View Article] [PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.001256
Loading
/content/journal/mgen/10.1099/mgen.0.001256
Loading

Data & Media loading...

Supplements

Supplementary material 1

EXCEL
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error