An assessment of genome annotation coverage across the bacterial tree of life Open Access

Abstract

Although gene-finding in bacterial genomes is relatively straightforward, the automated assignment of gene function is still challenging, resulting in a vast quantity of hypothetical sequences of unknown function. But how prevalent are hypothetical sequences across bacteria, what proportion of genes in different bacterial genomes remain unannotated, and what factors affect annotation completeness? To address these questions, we surveyed over 27 000 bacterial genomes from the Genome Taxonomy Database, and measured genome annotation completeness as a function of annotation method, taxonomy, genome size, 'research bias' and publication date. Our analysis revealed that 52 and 79 % of the average bacterial proteome could be functionally annotated based on protein and domain-based homology searches, respectively. Annotation coverage using protein homology search varied significantly from as low as 14 % in some species to as high as 98 % in others. We found that taxonomy is a major factor influencing annotation completeness, with distinct trends observed across the microbial tree (e.g. the lowest level of completeness was found in the lineage). Most lineages showed a significant association between genome size and annotation incompleteness, likely reflecting a greater degree of uncharacterized sequences in 'accessory' proteomes than in 'core' proteomes. Finally, research bias, as measured by publication volume, was also an important factor influencing genome annotation completeness, with early model organisms showing high completeness levels relative to other genomes in their own taxonomic lineages. Our work highlights the disparity in annotation coverage across the bacterial tree of life and emphasizes a need for more experimental characterization of accessory proteomes as well as understudied lineages.

Funding
This study was supported by the:
  • Natural Sciences and Engineering Research Council of Canada, http://dx.doi.org/10.13039/501100000038 (Award RGPIN-2019-04266)
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000341
2020-03-03
2024-03-29
Loading full text...

Full text loading...

/deliver/fulltext/mgen/6/3/mgen000341.html?itemId=/content/journal/mgen/10.1099/mgen.0.000341&mimeType=html&fmt=ahah

References

  1. Mendler K, Chen H, Parks DH, Lobb B, Hug LA et al. AnnoTree: visualization and exploration of a functionally annotated microbial tree of life. Nucleic Acids Res 2019; 47:4442–4448 [View Article]
    [Google Scholar]
  2. Parks DH, Chuvochina M, Waite DW, Rinke C, Skarshewski A et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 2018; 36:996–1004 [View Article]
    [Google Scholar]
  3. Lobb B, Doxey AC. Novel function discovery through sequence and structural data mining. Curr Opin Struct Biol 2016; 38:53–61 [View Article]
    [Google Scholar]
  4. Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D et al. Protein function annotation by homology-based inference. Genome Biol 2009; 10:207 [View Article]
    [Google Scholar]
  5. Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T et al. A large-scale evaluation of computational protein function prediction. Nat Methods 2013; 10:221–227 [View Article]
    [Google Scholar]
  6. Griesemer M, Kimbrel JA, Zhou CE, Navid A, D'haeseleer P. Combining multiple functional annotation tools increases coverage of metabolic annotation. BMC Genomics 2018; 19:948 [View Article]
    [Google Scholar]
  7. Ijaq J, Chandrasekharan M, Poddar R, Bethi N, Sundararajan VS. Annotation and curation of uncharacterized proteins – challenges. Front Genet 2015; 6:119 [View Article]
    [Google Scholar]
  8. Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Res 2003; 31:371–373 [View Article]
    [Google Scholar]
  9. Finn RD, Coggill P, Eberhardt RY, Eddy SR, Mistry J et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res 2016; 44:D279–D285 [View Article]
    [Google Scholar]
  10. Marchler-Bauer A, Lu S, Anderson JB, Chitsaz F, Derbyshire MK et al. CDD: a conserved domain database for the functional annotation of proteins. Nucleic Acids Res 2011; 39:D225–D229 [View Article]
    [Google Scholar]
  11. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014; 30:2068–2069 [View Article]
    [Google Scholar]
  12. Mavromatis K, Ivanova NN, Chen I-MA, Szeto E, Markowitz VM et al. The DOE-JGI standard operating procedure for the annotations of microbial genomes. Stand Genomic Sci 2009; 1:63–67 [View Article]
    [Google Scholar]
  13. Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res 2018; 46:D851–D860 [View Article]
    [Google Scholar]
  14. Meyer F, Overbeek R, Rodriguez A. FIGfams: yet another set of protein families. Nucleic Acids Res 2009; 37:6643–6654 [View Article]
    [Google Scholar]
  15. Tatusov RL, Galperin MY, Natale DA, Koonin EV. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res 2000; 28:33–36 [View Article]
    [Google Scholar]
  16. Perdigão N, Heinrich J, Stolte C, Sabir KS, Buckley MJ et al. Unexpected features of the dark proteome. Proc Natl Acad Sci USA 2015; 112:15898–15903 [View Article]
    [Google Scholar]
  17. Wyman SK, Avila-Herrera A, Nayfach S, Pollard KS. A most wanted list of conserved microbial protein families with no known domains. PLoS One 2018; 13:e0205749 [View Article]
    [Google Scholar]
  18. Lobb B, Kurtz DA, Moreno-Hagelsieb G, Doxey AC. Remote homology and the functions of metagenomic dark matter. Front Genet 2015; 6:234 [View Article]
    [Google Scholar]
  19. Galperin MY. Conserved 'hypothetical' proteins: new hints and new puzzles. Comp Funct Genomics 2001; 2:14–18 [View Article]
    [Google Scholar]
  20. Galperin MY, Koonin EV. 'Conserved hypothetical' proteins: prioritization of targets for experimental study. Nucleic Acids Res 2004; 32:5452–5463 [View Article]
    [Google Scholar]
  21. Desler C, Suravajhala P, Sanderhoff M, Rasmussen M, Rasmussen LJ. In silico screening for functional candidates amongst hypothetical proteins. BMC Bioinformatics 2009; 10:289 [View Article]
    [Google Scholar]
  22. Siew N, Fischer D. Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins 2003; 53:241–251 [View Article]
    [Google Scholar]
  23. Salzberg SL. Next-generation genome annotation: we still struggle to get it right. Genome Biol 2019; 20:92 [View Article]
    [Google Scholar]
  24. Arakawa K, Nakayama Y, Tomita M. GPAC: benchmarking the sensitivity of genome informatics analysis to genome annotation completeness. In Silico Biol 2006; 6:49–60
    [Google Scholar]
  25. Berent LM, Messick JB. Physical map and genome sequencing survey of Mycoplasma haemofelis (Haemobartonella felis). Infect Immun 2003; 71:3657–3662 [View Article]
    [Google Scholar]
  26. Kanehisa M, Araki M, Goto S, Hattori M, Hirakawa M et al. KEGG for linking genomes to life and the environment. Nucleic Acids Res 2008; 36:D480–D484 [View Article]
    [Google Scholar]
  27. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods 2015; 12:59–60 [View Article]
    [Google Scholar]
  28. Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res 2011; 39:W29–W37 [View Article]
    [Google Scholar]
  29. Pedruzzi I, Rivoire C, Auchincloss AH, Coudert E, Keller G et al. HAMAP in 2015: updates to the protein family classification and annotation system. Nucleic Acids Res 2015; 43:D1064–D1070 [View Article]
    [Google Scholar]
  30. Liu Y, Harrison PM, Kunin V, Gerstein M. Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Biol 2004; 5:R64 [View Article]
    [Google Scholar]
  31. Lerat E, Ochman H. Recognizing the pseudogenes in bacterial genomes. Nucleic Acids Res 2005; 33:3125–3132 [View Article]
    [Google Scholar]
  32. Hyatt D, Chen G-L, Locascio PF, Land ML, Larimer FW et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 2010; 11:119 [View Article]
    [Google Scholar]
  33. Rinke C, Schwientek P, Sczyrba A, Ivanova NN, Anderson IJ et al. Insights into the phylogeny and coding potential of microbial dark matter. Nature 2013; 499:431–437 [View Article]
    [Google Scholar]
  34. Hedlund BP, Dodsworth JA, Murugapiran SK, Rinke C, Woyke T. Impact of single-cell genomics and metagenomics on the emerging view of extremophile "microbial dark matter". Extremophiles 2014; 18:865–875 [View Article]
    [Google Scholar]
  35. Bobay L-M, Ochman H. The evolution of bacterial genome architecture. Front Genet 2017; 8:72 [View Article]
    [Google Scholar]
  36. Touchon M, Hoede C, Tenaillon O, Barbe V, Baeriswyl S et al. Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet 2009; 5:e1000344 [View Article]
    [Google Scholar]
  37. Koonin EV, Wolf YI. Genomics of bacteria and archaea: the emerging dynamic view of the prokaryotic world. Nucleic Acids Res 2008; 36:6688–6719 [View Article]
    [Google Scholar]
  38. Mushegian A. The minimal genome concept. Curr Opin Genet Dev 1999; 9:709–714 [View Article]
    [Google Scholar]
  39. Koonin EV. How many genes can make a cell: the minimal-gene-set concept. Annu Rev Genomics Hum Genet 2000; 1:99–116 [View Article]
    [Google Scholar]
  40. Moran NA, Mira A. The process of genome shrinkage in the obligate symbiont Buchnera aphidicola . Genome Biol 2001; 2:RESEARCH0054 [View Article]
    [Google Scholar]
  41. Gil R, Sabater-Muñoz B, Latorre A, Silva FJ, Moya A. Extreme genome reduction in Buchnera spp.: toward the minimal genome needed for symbiotic life. Proc Natl Acad Sci USA 2002; 99:4454–4458 [View Article]
    [Google Scholar]
  42. Cortez D, Forterre P, Gribaldo S. A hidden reservoir of integrative elements is the major source of recently acquired foreign genes and ORFans in archaeal and bacterial genomes. Genome Biol 2009; 10:R65 [View Article]
    [Google Scholar]
  43. Kuo C-H, Moran NA, Ochman H. The consequences of genetic drift for bacterial genome complexity. Genome Res 2009; 19:1450–1454 [View Article]
    [Google Scholar]
  44. Moran NA, McLaughlin HJ, Sorek R. The dynamics and time scale of ongoing genomic erosion in symbiotic bacteria. Science 2009; 323:379–382 [View Article]
    [Google Scholar]
  45. Kuo C-H, Ochman H. The extinction dynamics of bacterial pseudogenes. PLoS Genet 2010; 6:e1001050 [View Article]
    [Google Scholar]
  46. van Ham RCHJ, Kamerbeek J, Palacios C, Rausell C, Abascal F et al. Reductive genome evolution in Buchnera aphidicola . Proc Natl Acad Sci USA 2003; 100:581–586 [View Article]
    [Google Scholar]
  47. Gene Ontology Consortium Gene Ontology Consortium: going forward. Nucleic Acids Res 2015; 43:D1049–1056 [View Article]
    [Google Scholar]
  48. Tierrafría VH, Mejía-Almonte C, Camacho-Zaragoza JM, Salgado H, Alquicira K et al. MCO: towards an ontology and unified vocabulary for a framework-based annotation of microbial growth conditions. Bioinformatics 2019; 35:856–864 [View Article]
    [Google Scholar]
  49. Danchin A, Fang G. Unknown unknowns: essential genes in quest for function. Microb Biotechnol 2016; 9:530–540 [View Article]
    [Google Scholar]
  50. Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics 2005; 21:951–960 [View Article]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000341
Loading
/content/journal/mgen/10.1099/mgen.0.000341
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF

Most cited Most Cited RSS feed