1887

Abstract

Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as , and consider additional quality control for their results.

Loading

Article metrics loading...

/content/journal/micro/10.1099/mic.0.033811-0
2010-07-01
2019-12-13
Loading full text...

Full text loading...

/deliver/fulltext/micro/156/7/1909.html?itemId=/content/journal/micro/10.1099/mic.0.033811-0&mimeType=html&fmt=ahah

References

  1. Aggarwal, G., Worthey, E. A., McDonagh, P. D. & Myler, P. J. ( 2003; ). Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project. BMC Bioinformatics 4, 23 [CrossRef]
    [Google Scholar]
  2. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. ( 1990; ). Basic local alignment search tool. J Mol Biol 215, 403–410.[CrossRef]
    [Google Scholar]
  3. Ansong, C., Purvine, S. O., Adkins, J. N., Lipton, M. S. & Smith, R. D. ( 2008; ). Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Brief Funct Genomic Proteomic 7, 50–62.[CrossRef]
    [Google Scholar]
  4. Arigon, A. M., Perriere, G. & Gouy, M. ( 2008; ). Automatic identification of large collections of protein-coding or rRNA sequences. Biochimie 90, 609–614.[CrossRef]
    [Google Scholar]
  5. Armengaud, J. ( 2009; ). A perfect genome annotation is within reach with the proteomics and genomics alliance. Curr Opin Microbiol 12, 292–300.[CrossRef]
    [Google Scholar]
  6. Aziz, R. K., Bartels, D., Best, A. A., DeJongh, M., Disz, T., Edwards, R. A., Formsma, K., Gerdes, S., Glass, E. M. & other authors ( 2008; ). The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9, 75 [CrossRef]
    [Google Scholar]
  7. Bakke, P., Carney, N., Deloache, W., Gearing, M., Ingvorsen, K., Lotz, M., McNair, J., Penumetcha, P., Simpson, S. & other authors ( 2009; ). Evaluation of three automated genome annotations for Halorhabdus utahensis. PLoS One 4, e6291 [CrossRef]
    [Google Scholar]
  8. Besemer, J., Lomsadze, A. & Borodovsky, M. ( 2001; ). GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29, 2607–2618.[CrossRef]
    [Google Scholar]
  9. Bocs, S., Cruveiller, S., Vallenet, D., Nuel, G. & Medigue, C. ( 2003; ). AMIGene: annotation of microbial genes. Nucleic Acids Res 31, 3723–3726.[CrossRef]
    [Google Scholar]
  10. Bork, P. ( 2000; ). Powers and pitfalls in sequence analysis: the 70 % hurdle. Genome Res 10, 398–400.[CrossRef]
    [Google Scholar]
  11. Bork, P. & Bairoch, A. ( 1996; ). Go hunting in sequence databases but watch out for the traps. Trends Genet 12, 425–427.[CrossRef]
    [Google Scholar]
  12. Brenner, S. E. ( 1999; ). Errors in genome annotation. Trends Genet 15, 132–133.[CrossRef]
    [Google Scholar]
  13. Brent, M. R. ( 2008; ). Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9, 62–73.[CrossRef]
    [Google Scholar]
  14. de Groot, A., Dulermo, R., Ortet, P., Blanchard, L., Guérin, P., Fernandez, B., Vacherie, B., Dossat, C., Jolivet, E. & other authors ( 2009; ). Alliance of proteomics and genomics to unravel the specificities of Sahara bacterium Deinococcus deserti. PLoS Genet 5, e1000434 [CrossRef]
    [Google Scholar]
  15. Devos, D. & Valencia, A. ( 2001; ). Intrinsic errors in genome annotation. Trends Genet 17, 429–431.[CrossRef]
    [Google Scholar]
  16. Do, J. H. & Choi, D. K. ( 2006; ). Computational approaches to gene prediction. J Microbiol 44, 137–144.
    [Google Scholar]
  17. Farabaugh, P. J. ( 1996; ). Programmed translational frameshifting. Annu Rev Genet 30, 507–528.[CrossRef]
    [Google Scholar]
  18. Farrer, R. A., Kemen, E., Jones, J. D. & Studholme, D. J. ( 2009; ). De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads. FEMS Microbiol Lett 291, 103–111.[CrossRef]
    [Google Scholar]
  19. Friedberg, I. ( 2006; ). Automated protein function prediction – the genomic challenge. Brief Bioinform 7, 225–242.[CrossRef]
    [Google Scholar]
  20. Higgs, P. G. & Attwood, T. K. ( 2005; ). Bioinformatics and Molecular Evolution. Malden, MA: Blackwell.
  21. Jaffe, J. D., Stange-Thomann, N., Smith, C., DeCaprio, D., Fisher, S., Butler, J., Calvo, S., Elkins, T., FitzGerald, M. G. & other authors ( 2004; ). The complete genome and proteome of Mycoplasma mobile. Genome Res 14, 1447–1461.[CrossRef]
    [Google Scholar]
  22. Jones, C. E., Brown, A. L. & Baumann, U. ( 2007; ). Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinformatics 8, 170 [CrossRef]
    [Google Scholar]
  23. Kellis, M., Patterson, N., Endrizzi, M., Birren, B. & Lander, E. S. ( 2003; ). Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423, 241–254.[CrossRef]
    [Google Scholar]
  24. Keseler, I. M., Bonavides-Martinez, C., Collado-Vides, J., Gama-Castro, S., Gunsalus, R. P., Johnson, D. A., Krummenacker, M., Nolan, L. M., Paley, S. & other authors ( 2009; ). EcoCyc: a comprehensive view of Escherichia coli biology. Nucleic Acids Res 37, D464–D470.[CrossRef]
    [Google Scholar]
  25. Knapp, K. & Chen, Y. P. ( 2007; ). An evaluation of contemporary hidden Markov model genefinders with a predicted exon taxonomy. Nucleic Acids Res 35, 317–324.[CrossRef]
    [Google Scholar]
  26. Lapierre, P. & Gogarten, J. P. ( 2009; ). Estimating the size of the bacterial pan-genome. Trends Genet 25, 107–110.[CrossRef]
    [Google Scholar]
  27. Lee, D., Redfern, O. & Orengo, C. ( 2007; ). Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 8, 995–1005.[CrossRef]
    [Google Scholar]
  28. Liolios, K., Mavromatis, K., Tavernarakis, N. & Kyrpides, N. C. ( 2008; ). The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 36, D475–D479.[CrossRef]
    [Google Scholar]
  29. Liu, Y., Harrison, P. M., Kunin, V. & Gerstein, M. ( 2004; ). Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Biol 5, R64 [CrossRef]
    [Google Scholar]
  30. Majoros, W. H., Pertea, M., Antonescu, C. & Salzberg, S. L. ( 2003; ). GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders. Nucleic Acids Res 31, 3601–3604.[CrossRef]
    [Google Scholar]
  31. Medigue, C. & Moszer, I. ( 2007; ). Annotation, comparison and databases for hundreds of bacterial genomes. Res Microbiol 158, 724–736.[CrossRef]
    [Google Scholar]
  32. Nagy, A., Hegyi, H., Farkas, K., Tordai, H., Kozma, E., Banyai, L. & Patthy, L. ( 2008; ). Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics 9, 353 [CrossRef]
    [Google Scholar]
  33. Nanavati, D. M., Thirangoon, K. & Noll, K. M. ( 2006; ). Several archaeal homologs of putative oligopeptide-binding proteins encoded by Thermotoga maritima bind sugars. Appl Environ Microbiol 72, 1336–1345.[CrossRef]
    [Google Scholar]
  34. Overbeek, R., Begley, T., Butler, R. M., Choudhuri, J. V., Chuang, H. Y., Cohoon, M., de Crécy-Lagard, V., Diaz, N., Disz, T. & other authors ( 2005; ). The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 33, 5691–5702.[CrossRef]
    [Google Scholar]
  35. Palleja, A., Harrington, E. D. & Bork, P. ( 2008; ). Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions? BMC Genomics 9, 335 [CrossRef]
    [Google Scholar]
  36. Poptsova, M. S. ( 2008; ). Computational techniques for orthologous gene prediction in prokaryotes. In Computational Methods for Understanding Bacterial and Archaeal Genomes, pp. 209–232. Edited by Y. Xu & J. P. Gogarten. London: Imperial College Press.
  37. Poptsova, M. S. & Gogarten, J. P. ( 2007; ). BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics 8, 120 [CrossRef]
    [Google Scholar]
  38. Reed, J. L., Famili, I., Thiele, I. & Palsson, B. O. ( 2006; ). Towards multidimensional genome annotation. Nat Rev Genet 7, 130–141.[CrossRef]
    [Google Scholar]
  39. Reeves, G. A., Talavera, D. & Thornton, J. M. ( 2009; ). Genome and proteome annotation: organization, interpretation and integration. J R Soc Interface 6, 129–147.[CrossRef]
    [Google Scholar]
  40. Riley, M., Abe, T., Arnaud, M. B., Berlyn, M. K., Blattner, F. R., Chaudhuri, R. R., Glasner, J. D., Horiuchi, T., Keseler, I. M. & other authors ( 2006; ). Escherichia coli K-12: a cooperatively developed annotation snapshot – 2005. Nucleic Acids Res 34, 1–9.[CrossRef]
    [Google Scholar]
  41. Rudd, K. E. ( 2000; ). EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res 28, 60–64.[CrossRef]
    [Google Scholar]
  42. Salzberg, S. L. ( 2007; ). Genome re-annotation: a wiki solution? Genome Biol 8, 102 [CrossRef]
    [Google Scholar]
  43. Salzberg, S. L., Delcher, A. L., Kasif, S. & White, O. ( 1998; ). Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26, 544–548.[CrossRef]
    [Google Scholar]
  44. Siew, N. & Fischer, D. ( 2003; ). Unravelling the ORFan puzzle. Comp Funct Genomics 4, 432–441.[CrossRef]
    [Google Scholar]
  45. Stothard, P. & Wishart, D. S. ( 2006; ). Automated bacterial genome analysis and annotation. Curr Opin Microbiol 9, 505–510.[CrossRef]
    [Google Scholar]
  46. Tenney, A. E., Brown, R. H., Vaske, C., Lodge, J. K., Doering, T. L. & Brent, M. R. ( 2004; ). Gene prediction and verification in a compact genome with numerous small introns. Genome Res 14, 2330–2335.[CrossRef]
    [Google Scholar]
  47. Touchon, M., Hoede, C., Tenaillon, O., Barbe, V., Baeriswyl, S., Bidet, P., Bingen, E., Bonacorsi, S., Bouchier, C. & other authors ( 2009; ). Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet 5, e1000344 [CrossRef]
    [Google Scholar]
  48. Vallenet, D., Labarre, L., Rouy, Z., Barbe, V., Bocs, S., Cruveiller, S., Lajus, A., Pascal, G., Scarpelli, C. & Médigue, C. ( 2006; ). MaGe: a microbial genome annotation system supported by synteny results. Nucleic Acids Res 34, 53–65.[CrossRef]
    [Google Scholar]
  49. Windsor, A. J. & Mitchell-Olds, T. ( 2006; ). Comparative genomics as a tool for gene discovery. Curr Opin Biotechnol 17, 161–167.[CrossRef]
    [Google Scholar]
  50. Yada, T., Totoki, Y., Takagi, T. & Nakai, K. ( 2001; ). A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res 8, 97–106.[CrossRef]
    [Google Scholar]
  51. Yates, J. R., III, Eng, J. K. & McCormack, A. L. ( 1995; ). Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal Chem 67, 3202–3210.[CrossRef]
    [Google Scholar]
  52. Zhu, H. Q., Hu, G. Q., Ouyang, Z. Q., Wang, J. & She, Z. S. ( 2004; ). Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics 20, 3308–3317.[CrossRef]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/micro/10.1099/mic.0.033811-0
Loading
/content/journal/micro/10.1099/mic.0.033811-0
Loading

Data & Media loading...

Supplements

vol. , part 7, pp. 1909 - 1917

Details of the 30 strains used for the test study, and the results of the search for missing orthologues in these strains [ Excel file] (987 kb)



EXCEL
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error