1887

Abstract

Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as , and consider additional quality control for their results.

Loading

Article metrics loading...

/content/journal/micro/10.1099/mic.0.033811-0
2010-07-01
2020-10-23
Loading full text...

Full text loading...

/deliver/fulltext/micro/156/7/1909.html?itemId=/content/journal/micro/10.1099/mic.0.033811-0&mimeType=html&fmt=ahah

References

  1. Aggarwal G., Worthey E. A., McDonagh P. D., Myler P. J.. 2003; Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project. BMC Bioinformatics4:23
    [Google Scholar]
  2. Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J.. 1990; Basic local alignment search tool. J Mol Biol215:403–410
    [Google Scholar]
  3. Ansong C., Purvine S. O., Adkins J. N., Lipton M. S., Smith R. D.. 2008; Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Brief Funct Genomic Proteomic7:50–62
    [Google Scholar]
  4. Arigon A. M., Perriere G., Gouy M.. 2008; Automatic identification of large collections of protein-coding or rRNA sequences. Biochimie90:609–614
    [Google Scholar]
  5. Armengaud J.. 2009; A perfect genome annotation is within reach with the proteomics and genomics alliance. Curr Opin Microbiol12:292–300
    [Google Scholar]
  6. Aziz R. K., Bartels D., Best A. A., DeJongh M., Disz T., Edwards R. A., Formsma K., Gerdes S., Glass E. M.. other authors 2008; The RAST Server: rapid annotations using subsystems technology. BMC Genomics9:75
    [Google Scholar]
  7. Bakke P., Carney N., Deloache W., Gearing M., Ingvorsen K., Lotz M., McNair J., Penumetcha P., Simpson S.. other authors 2009; Evaluation of three automated genome annotations for Halorhabdus utahensis. PLoS One4:e6291
    [Google Scholar]
  8. Besemer J., Lomsadze A., Borodovsky M.. 2001; GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res29:2607–2618
    [Google Scholar]
  9. Bocs S., Cruveiller S., Vallenet D., Nuel G., Medigue C.. 2003; AMIGene: annotation of microbial genes. Nucleic Acids Res31:3723–3726
    [Google Scholar]
  10. Bork P.. 2000; Powers and pitfalls in sequence analysis: the 70 % hurdle. Genome Res10:398–400
    [Google Scholar]
  11. Bork P., Bairoch A.. 1996; Go hunting in sequence databases but watch out for the traps. Trends Genet12:425–427
    [Google Scholar]
  12. Brenner S. E.. 1999; Errors in genome annotation. Trends Genet15:132–133
    [Google Scholar]
  13. Brent M. R.. 2008; Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet9:62–73
    [Google Scholar]
  14. de Groot A., Dulermo R., Ortet P., Blanchard L., Guérin P., Fernandez B., Vacherie B., Dossat C., Jolivet E.. other authors 2009; Alliance of proteomics and genomics to unravel the specificities of Sahara bacterium Deinococcus deserti. PLoS Genet5:e1000434
    [Google Scholar]
  15. Devos D., Valencia A.. 2001; Intrinsic errors in genome annotation. Trends Genet17:429–431
    [Google Scholar]
  16. Do J. H., Choi D. K.. 2006; Computational approaches to gene prediction. J Microbiol44:137–144
    [Google Scholar]
  17. Farabaugh P. J.. 1996; Programmed translational frameshifting. Annu Rev Genet30:507–528
    [Google Scholar]
  18. Farrer R. A., Kemen E., Jones J. D., Studholme D. J.. 2009; De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads. FEMS Microbiol Lett291:103–111
    [Google Scholar]
  19. Friedberg I.. 2006; Automated protein function prediction – the genomic challenge. Brief Bioinform7:225–242
    [Google Scholar]
  20. Higgs P. G., Attwood T. K.. 2005; Bioinformatics and Molecular Evolution Malden, MA: Blackwell;
  21. Jaffe J. D., Stange-Thomann N., Smith C., DeCaprio D., Fisher S., Butler J., Calvo S., Elkins T., FitzGerald M. G.. other authors 2004; The complete genome and proteome of Mycoplasma mobile. Genome Res14:1447–1461
    [Google Scholar]
  22. Jones C. E., Brown A. L., Baumann U.. 2007; Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinformatics8:170
    [Google Scholar]
  23. Kellis M., Patterson N., Endrizzi M., Birren B., Lander E. S.. 2003; Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature423:241–254
    [Google Scholar]
  24. Keseler I. M., Bonavides-Martinez C., Collado-Vides J., Gama-Castro S., Gunsalus R. P., Johnson D. A., Krummenacker M., Nolan L. M., Paley S.. other authors 2009; EcoCyc: a comprehensive view of Escherichia coli biology. Nucleic Acids Res37:D464–D470
    [Google Scholar]
  25. Knapp K., Chen Y. P.. 2007; An evaluation of contemporary hidden Markov model genefinders with a predicted exon taxonomy. Nucleic Acids Res35:317–324
    [Google Scholar]
  26. Lapierre P., Gogarten J. P.. 2009; Estimating the size of the bacterial pan-genome. Trends Genet25:107–110
    [Google Scholar]
  27. Lee D., Redfern O., Orengo C.. 2007; Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol8:995–1005
    [Google Scholar]
  28. Liolios K., Mavromatis K., Tavernarakis N., Kyrpides N. C.. 2008; The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res36:D475–D479
    [Google Scholar]
  29. Liu Y., Harrison P. M., Kunin V., Gerstein M.. 2004; Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Biol5:R64
    [Google Scholar]
  30. Majoros W. H., Pertea M., Antonescu C., Salzberg S. L.. 2003; GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders. Nucleic Acids Res31:3601–3604
    [Google Scholar]
  31. Medigue C., Moszer I.. 2007; Annotation, comparison and databases for hundreds of bacterial genomes. Res Microbiol158:724–736
    [Google Scholar]
  32. Nagy A., Hegyi H., Farkas K., Tordai H., Kozma E., Banyai L., Patthy L.. 2008; Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics9:353
    [Google Scholar]
  33. Nanavati D. M., Thirangoon K., Noll K. M.. 2006; Several archaeal homologs of putative oligopeptide-binding proteins encoded by Thermotoga maritima bind sugars. Appl Environ Microbiol72:1336–1345
    [Google Scholar]
  34. Overbeek R., Begley T., Butler R. M., Choudhuri J. V., Chuang H. Y., Cohoon M., de Crécy-Lagard V., Diaz N., Disz T.. other authors 2005; The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res33:5691–5702
    [Google Scholar]
  35. Palleja A., Harrington E. D., Bork P.. 2008; Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?. BMC Genomics9:335
    [Google Scholar]
  36. Poptsova M. S.. 2008; Computational techniques for orthologous gene prediction in prokaryotes. In Computational Methods for Understanding Bacterial and Archaeal Genomes pp209–232 Edited by Xu Y., Gogarten. London: Imperial College Press;
  37. Poptsova M. S., Gogarten J. P.. 2007; BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics8:120
    [Google Scholar]
  38. Reed J. L., Famili I., Thiele I., Palsson B. O.. 2006; Towards multidimensional genome annotation. Nat Rev Genet7:130–141
    [Google Scholar]
  39. Reeves G. A., Talavera D., Thornton J. M.. 2009; Genome and proteome annotation: organization, interpretation and integration. J R Soc Interface6:129–147
    [Google Scholar]
  40. Riley M., Abe T., Arnaud M. B., Berlyn M. K., Blattner F. R., Chaudhuri R. R., Glasner J. D., Horiuchi T., Keseler I. M.. other authors 2006; Escherichia coli K-12: a cooperatively developed annotation snapshot – 2005. Nucleic Acids Res34:1–9
    [Google Scholar]
  41. Rudd K. E.. 2000; EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res28:60–64
    [Google Scholar]
  42. Salzberg S. L.. 2007; Genome re-annotation: a wiki solution?. Genome Biol8:102
    [Google Scholar]
  43. Salzberg S. L., Delcher A. L., Kasif S., White O.. 1998; Microbial gene identification using interpolated Markov models. Nucleic Acids Res26:544–548
    [Google Scholar]
  44. Siew N., Fischer D.. 2003; Unravelling the ORFan puzzle. Comp Funct Genomics4:432–441
    [Google Scholar]
  45. Stothard P., Wishart D. S.. 2006; Automated bacterial genome analysis and annotation. Curr Opin Microbiol9:505–510
    [Google Scholar]
  46. Tenney A. E., Brown R. H., Vaske C., Lodge J. K., Doering T. L., Brent M. R.. 2004; Gene prediction and verification in a compact genome with numerous small introns. Genome Res14:2330–2335
    [Google Scholar]
  47. Touchon M., Hoede C., Tenaillon O., Barbe V., Baeriswyl S., Bidet P., Bingen E., Bonacorsi S., Bouchier C.. other authors 2009; Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet5:e1000344
    [Google Scholar]
  48. Vallenet D., Labarre L., Rouy Z., Barbe V., Bocs S., Cruveiller S., Lajus A., Pascal G., Scarpelli C., Médigue C.. 2006; MaGe: a microbial genome annotation system supported by synteny results. Nucleic Acids Res34:53–65
    [Google Scholar]
  49. Windsor A. J., Mitchell-Olds T.. 2006; Comparative genomics as a tool for gene discovery. Curr Opin Biotechnol17:161–167
    [Google Scholar]
  50. Yada T., Totoki Y., Takagi T., Nakai K.. 2001; A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res8:97–106
    [Google Scholar]
  51. Yates J. R. III, Eng J. K., McCormack A. L.. 1995; Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal Chem67:3202–3210
    [Google Scholar]
  52. Zhu H. Q., Hu G. Q., Ouyang Z. Q., Wang J., She Z. S.. 2004; Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics20:3308–3317
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/micro/10.1099/mic.0.033811-0
Loading
/content/journal/micro/10.1099/mic.0.033811-0
Loading

Data & Media loading...

Most cited this month Most Cited RSS feed

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error