1887

Abstract

Metagenomics has become a prominent technology to study the functional potential of all organisms in a microbial community. Most studies focus on the bacterial content of these communities, while ignoring eukaryotic microbes. Indeed, many metagenomics analysis pipelines silently assume that all contigs in a metagenome are prokaryotic, likely resulting in less accurate annotation of eukaryotes in metagenomes. Early detection of eukaryotic contigs allows for eukaryote-specific gene prediction and functional annotation. Here, we developed a classifier that distinguishes eukaryotic from prokaryotic contigs based on foundational differences between these taxa in terms of gene structure. We first developed Whokaryote, a random forest classifier that uses intergenic distance, gene density and gene length as the most important features. We show that, with an estimated recall, precision and accuracy of 94, 96 and 95 %, respectively, this classifier with features grounded in biology can perform almost as well as the classifiers EukRep and Tiara, which use k-mer frequencies as features. By retraining our classifier with Tiara predictions as an additional feature, the weaknesses of both types of classifiers are compensated; the result is Whokaryote+Tiara, an enhanced classifier that outperforms all individual classifiers, with an F1 score of 0.99 for both eukaryotes and prokaryotes, while still being fast. In a reanalysis of metagenome data from a disease-suppressive plant endospheric microbial community, we show how using Whokaryote+Tiara to select contigs for eukaryotic gene prediction facilitates the discovery of several biosynthetic gene clusters that were missed in the original study. Whokaryote (+Tiara) is wrapped in an easily installable package and is freely available from https://github.com/LottePronk/whokaryote.

Funding
This study was supported by the:
  • Nederlandse Organisatie voor Wetenschappelijk Onderzoek (Award OCENW.GROOT.2019.063)
    • Principle Award Recipient: MarnixH Medema
  • This is an open-access article distributed under the terms of the Creative Commons Attribution License.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000823
2022-05-03
2022-05-24
Loading full text...

Full text loading...

/deliver/fulltext/mgen/8/5/mgen000823.html?itemId=/content/journal/mgen/10.1099/mgen.0.000823&mimeType=html&fmt=ahah

References

  1. Trivedi P, Leach JE, Tringe SG, Sa T, Singh BK. Plant-microbiome interactions: from community assembly to plant health. Nat Rev Microbiol 2020; 18:607–621 [View Article] [PubMed]
    [Google Scholar]
  2. Fan Y, Pedersen O. Gut microbiota in human metabolic health and disease. Nat Rev Microbiol 2021; 19:55–71 [View Article] [PubMed]
    [Google Scholar]
  3. Carrión VJ, Perez-Jaramillo J, Cordovez V, Tracanna V, de Hollander M et al. Pathogen-induced activation of disease-suppressive functions in the endophytic root microbiome. Science 2019; 366:606–612 [View Article] [PubMed]
    [Google Scholar]
  4. Zan J, Li Z, Tianero M, Davis J, Hill RT et al. A microbial factory for defensive kahalalides in A tripartite marine symbiosis. Science 2019; 364:eaaw6732 [View Article]
    [Google Scholar]
  5. Forsberg KJ, Bhatt IV, Schmidtke DT, Javanmardi K, Dillard KE et al. Functional metagenomics-guided discovery of potent Cas9 inhibitors in the human microbiome. eLife 2019; 8:e46540 [View Article] [PubMed]
    [Google Scholar]
  6. Laforest-Lapointe I, Arrieta M-C. Microbial eukaryotes: a missing link in gut microbiome studies. mSystems 2018; 3:e00201-17 [View Article] [PubMed]
    [Google Scholar]
  7. Soler P, Moreno-Mesonero L, Zornoza A, Macián VJ, Moreno Y. Characterization of eukaryotic microbiome and associated bacteria communities in a drinking water treatment plant. Sci Total Environ 2021; 797:149070 [View Article] [PubMed]
    [Google Scholar]
  8. Ainsworth TD, Fordyce AJ, Camp EF. The other microeukaryotes of the coral reef microbiome. Trends Microbiol 2017; 25:980–991 [View Article] [PubMed]
    [Google Scholar]
  9. Yurgel SN, Douglas GM, Comeau AM, Mammoliti M, Dusault A et al. Variation in bacterial and eukaryotic communities associated with natural and managed wild blueberry habitats. Phytobiomes Journal 2017; 1:102–113 [View Article]
    [Google Scholar]
  10. von Meijenfeldt FAB, Arkhipova K, Cambuy DD, Coutinho FH, Dutilh BE. Robust taxonomic classification of uncharted microbial sequences and bins with CAT and BAT. Genome Biol 2019; 20:1–14 [View Article] [PubMed]
    [Google Scholar]
  11. Bağcı C, Patz S, Huson DH. DIAMOND+MEGAN: fast and easy taxonomic and functional analysis of short and long microbiome sequences. Curr Protoc 2021; 1:e59 [View Article] [PubMed]
    [Google Scholar]
  12. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol 2019; 20:257 [View Article] [PubMed]
    [Google Scholar]
  13. Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 2008; 24:637–644 [View Article] [PubMed]
    [Google Scholar]
  14. West PT, Probst AJ, Grigoriev IV, Thomas BC, Banfield JF. Genome-reconstruction for eukaryotes from complex natural microbial communities. Genome Res 2018; 28:569–580 [View Article] [PubMed]
    [Google Scholar]
  15. Levy Karin E, Mirdita M, Söding J. MetaEuk-sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. Microbiome 2020; 8:48 [View Article] [PubMed]
    [Google Scholar]
  16. Lind AL, Pollard KS. Accurate and sensitive detection of microbial eukaryotes from whole metagenome shotgun sequencing. Microbiome 2021; 9:58 [View Article] [PubMed]
    [Google Scholar]
  17. Karlicki M, Antonowicz S, Karnkowska A. Tiara: Deep learning-based classification system for eukaryotic sequences. Bioinformatics 2021btab672 [View Article] [PubMed]
    [Google Scholar]
  18. Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 2010; 11:119 [View Article] [PubMed]
    [Google Scholar]
  19. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B et al. Scikit-learn: machine learning in python. J Mach Learn Res [Internet] 2011; 12:2825–2830
    [Google Scholar]
  20. Weng L-W, Lin Y-C, Su C-C, Huang C-T, Cho S-T et al. Complete genome sequence of Xylella taiwanensis and comparative analysis of virulence gene content with Xylella fastidiosa . Front Microbiol 2021; 12:684092 [View Article] [PubMed]
    [Google Scholar]
  21. Lerat E, Ochman H. Recognizing the pseudogenes in bacterial genomes. Nucleic Acids Res 2005; 33:3125–3132 [View Article] [PubMed]
    [Google Scholar]
  22. Blin K, Shaw S, Steinke K, Villebro R, Ziemert N et al. antiSMASH 5.0: updates to the secondary metabolite genome mining pipeline. Nucleic Acids Res 2019; 47:W81–W87 [View Article] [PubMed]
    [Google Scholar]
  23. Majoros WH, Pertea M, Salzberg SL. TigrScan and GlimmerHMM: two open source ab initio eukaryotic gene-finders. Bioinformatics 2004; 20:2878–2879 [View Article] [PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000823
Loading
/content/journal/mgen/10.1099/mgen.0.000823
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF

Supplementary material 2

EXCEL

Most cited this month Most Cited RSS feed

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error