Using comparative genome analysis to identify problems in annotated microbial genomes

Maria S. Poptsova; J. Peter Gogarten

doi:10.1099/mic.0.033811-0

Volume 156, Issue 7

Other

Free

Using comparative genome analysis to identify problems in annotated microbial genomes

Maria S. Poptsova¹ and J. Peter Gogarten¹
View Affiliations Hide Affiliations

Affiliations: ¹ Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT 06269-3125, USA
CorrespondenceMaria S. Poptsova  [email protected]
Published: 01 July 2010 https://doi.org/10.1099/mic.0.033811-0

Abstract

Genome annotation is a tedious task that is mostly done by automated methods; however, the accuracy of these approaches has been questioned since the beginning of the sequencing era. Genome annotation is a multilevel process, and errors can emerge at different stages: during sequencing, as a result of gene-calling procedures, and in the process of assigning gene functions. Missed or wrongly annotated genes differentially impact different types of analyses. Here we discuss and demonstrate how the methods of comparative genome analysis can refine annotations by locating missing orthologues. We also discuss possible reasons for errors and show that the second-generation annotation systems, which combine multiple gene-calling programs with similarity-based methods, perform much better than the first annotation tools. Since old errors may propagate to the newly sequenced genomes, we emphasize that the problem of continuously updating popular public databases is an urgent and unresolved one. Due to the progress in genome-sequencing technologies, automated annotation techniques will remain the main approach in the future. Researchers need to be aware of the existing errors in the annotation of even well-studied genomes, such as Escherichia coli, and consider additional quality control for their results.

Published Online: 01/07/2010

Keyword(s): CDS, coding sequence(s) and HMM, hidden Markov model

SGM

Article metrics loading...

/content/journal/micro/10.1099/mic.0.033811-0

2010-07-01

2024-05-04

Full text loading...

/deliver/fulltext/micro/156/7/1909.html?itemId=/content/journal/micro/10.1099/mic.0.033811-0&mimeType=html&fmt=ahah

References

Aggarwal G., Worthey E. A., McDonagh P. D., Myler P. J. 2003; Importing statistical measures into Artemis enhances gene identification in the Leishmania genome project. BMC Bioinformatics 4:23
[Google Scholar]
Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. 1990; Basic local alignment search tool. J Mol Biol 215:403–410
[Google Scholar]
Ansong C., Purvine S. O., Adkins J. N., Lipton M. S., Smith R. D. 2008; Proteogenomics: needs and roles to be filled by proteomics in genome annotation. Brief Funct Genomic Proteomic 7:50–62
[Google Scholar]
Arigon A. M., Perriere G., Gouy M. 2008; Automatic identification of large collections of protein-coding or rRNA sequences. Biochimie 90:609–614
[Google Scholar]
Armengaud J. 2009; A perfect genome annotation is within reach with the proteomics and genomics alliance. Curr Opin Microbiol 12:292–300
[Google Scholar]
Aziz R. K., Bartels D., Best A. A., DeJongh M., Disz T., Edwards R. A., Formsma K., Gerdes S., Glass E. M. other authors 2008; The RAST Server: rapid annotations using subsystems technology. BMC Genomics 9:75
[Google Scholar]
Bakke P., Carney N., Deloache W., Gearing M., Ingvorsen K., Lotz M., McNair J., Penumetcha P., Simpson S. other authors 2009; Evaluation of three automated genome annotations for Halorhabdus utahensis. PLoS One 4:e6291
[Google Scholar]
Besemer J., Lomsadze A., Borodovsky M. 2001; GeneMarkS: a self-training method for prediction of gene starts in microbial genomes. Implications for finding sequence motifs in regulatory regions. Nucleic Acids Res 29:2607–2618
[Google Scholar]
Bocs S., Cruveiller S., Vallenet D., Nuel G., Medigue C. 2003; AMIGene: annotation of microbial genes. Nucleic Acids Res 31:3723–3726
[Google Scholar]
Bork P. 2000; Powers and pitfalls in sequence analysis: the 70 % hurdle. Genome Res 10:398–400
[Google Scholar]
Bork P., Bairoch A. 1996; Go hunting in sequence databases but watch out for the traps. Trends Genet 12:425–427
[Google Scholar]
Brenner S. E. 1999; Errors in genome annotation. Trends Genet 15:132–133
[Google Scholar]
Brent M. R. 2008; Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9:62–73
[Google Scholar]
de Groot A., Dulermo R., Ortet P., Blanchard L., Guérin P., Fernandez B., Vacherie B., Dossat C., Jolivet E. other authors 2009; Alliance of proteomics and genomics to unravel the specificities of Sahara bacterium Deinococcus deserti. PLoS Genet 5:e1000434
[Google Scholar]
Devos D., Valencia A. 2001; Intrinsic errors in genome annotation. Trends Genet 17:429–431
[Google Scholar]
Do J. H., Choi D. K. 2006; Computational approaches to gene prediction. J Microbiol 44:137–144
[Google Scholar]
Farabaugh P. J. 1996; Programmed translational frameshifting. Annu Rev Genet 30:507–528
[Google Scholar]
Farrer R. A., Kemen E., Jones J. D., Studholme D. J. 2009; De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads. FEMS Microbiol Lett 291:103–111
[Google Scholar]
Friedberg I. 2006; Automated protein function prediction – the genomic challenge. Brief Bioinform 7:225–242
[Google Scholar]
Higgs P. G., Attwood T. K. 2005 Bioinformatics and Molecular Evolution Malden, MA: Blackwell;
Jaffe J. D., Stange-Thomann N., Smith C., DeCaprio D., Fisher S., Butler J., Calvo S., Elkins T., FitzGerald M. G. other authors 2004; The complete genome and proteome of Mycoplasma mobile. Genome Res 14:1447–1461
[Google Scholar]
Jones C. E., Brown A. L., Baumann U. 2007; Estimating the annotation error rate of curated GO database sequence annotations. BMC Bioinformatics 8:170
[Google Scholar]
Kellis M., Patterson N., Endrizzi M., Birren B., Lander E. S. 2003; Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423:241–254
[Google Scholar]
Keseler I. M., Bonavides-Martinez C., Collado-Vides J., Gama-Castro S., Gunsalus R. P., Johnson D. A., Krummenacker M., Nolan L. M., Paley S. other authors 2009; EcoCyc: a comprehensive view of Escherichia coli biology. Nucleic Acids Res 37:D464–D470
[Google Scholar]
Knapp K., Chen Y. P. 2007; An evaluation of contemporary hidden Markov model genefinders with a predicted exon taxonomy. Nucleic Acids Res 35:317–324
[Google Scholar]
Lapierre P., Gogarten J. P. 2009; Estimating the size of the bacterial pan-genome. Trends Genet 25:107–110
[Google Scholar]
Lee D., Redfern O., Orengo C. 2007; Predicting protein function from sequence and structure. Nat Rev Mol Cell Biol 8:995–1005
[Google Scholar]
Liolios K., Mavromatis K., Tavernarakis N., Kyrpides N. C. 2008; The Genomes On Line Database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 36:D475–D479
[Google Scholar]
Liu Y., Harrison P. M., Kunin V., Gerstein M. 2004; Comprehensive analysis of pseudogenes in prokaryotes: widespread gene decay and failure of putative horizontally transferred genes. Genome Biol 5:R64
[Google Scholar]
Majoros W. H., Pertea M., Antonescu C., Salzberg S. L. 2003; GlimmerM, Exonomy and Unveil: three ab initio eukaryotic genefinders. Nucleic Acids Res 31:3601–3604
[Google Scholar]
Medigue C., Moszer I. 2007; Annotation, comparison and databases for hundreds of bacterial genomes. Res Microbiol 158:724–736
[Google Scholar]
Nagy A., Hegyi H., Farkas K., Tordai H., Kozma E., Banyai L., Patthy L. 2008; Identification and correction of abnormal, incomplete and mispredicted proteins in public databases. BMC Bioinformatics 9:353
[Google Scholar]
Nanavati D. M., Thirangoon K., Noll K. M. 2006; Several archaeal homologs of putative oligopeptide-binding proteins encoded by Thermotoga maritima bind sugars. Appl Environ Microbiol 72:1336–1345
[Google Scholar]
Overbeek R., Begley T., Butler R. M., Choudhuri J. V., Chuang H. Y., Cohoon M., de Crécy-Lagard V., Diaz N., Disz T. other authors 2005; The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 33:5691–5702
[Google Scholar]
Palleja A., Harrington E. D., Bork P. 2008; Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions?. BMC Genomics 9:335
[Google Scholar]
Poptsova M. S. 2008; Computational techniques for orthologous gene prediction in prokaryotes. In Computational Methods for Understanding Bacterial and Archaeal Genomes pp 209–232 Edited by Xu Y., Gogarten. London: Imperial College Press;
[Google Scholar]
Poptsova M. S., Gogarten J. P. 2007; BranchClust: a phylogenetic algorithm for selecting gene families. BMC Bioinformatics 8:120
[Google Scholar]
Reed J. L., Famili I., Thiele I., Palsson B. O. 2006; Towards multidimensional genome annotation. Nat Rev Genet 7:130–141
[Google Scholar]
Reeves G. A., Talavera D., Thornton J. M. 2009; Genome and proteome annotation: organization, interpretation and integration. J R Soc Interface 6:129–147
[Google Scholar]
Riley M., Abe T., Arnaud M. B., Berlyn M. K., Blattner F. R., Chaudhuri R. R., Glasner J. D., Horiuchi T., Keseler I. M. other authors 2006; Escherichia coli K-12: a cooperatively developed annotation snapshot – 2005. Nucleic Acids Res 34:1–9
[Google Scholar]
Rudd K. E. 2000; EcoGene: a genome sequence database for Escherichia coli K-12. Nucleic Acids Res 28:60–64
[Google Scholar]
Salzberg S. L. 2007; Genome re-annotation: a wiki solution?. Genome Biol 8:102
[Google Scholar]
Salzberg S. L., Delcher A. L., Kasif S., White O. 1998; Microbial gene identification using interpolated Markov models. Nucleic Acids Res 26:544–548
[Google Scholar]
Siew N., Fischer D. 2003; Unravelling the ORFan puzzle. Comp Funct Genomics 4:432–441
[Google Scholar]
Stothard P., Wishart D. S. 2006; Automated bacterial genome analysis and annotation. Curr Opin Microbiol 9:505–510
[Google Scholar]
Tenney A. E., Brown R. H., Vaske C., Lodge J. K., Doering T. L., Brent M. R. 2004; Gene prediction and verification in a compact genome with numerous small introns. Genome Res 14:2330–2335
[Google Scholar]
Touchon M., Hoede C., Tenaillon O., Barbe V., Baeriswyl S., Bidet P., Bingen E., Bonacorsi S., Bouchier C. other authors 2009; Organised genome dynamics in the Escherichia coli species results in highly diverse adaptive paths. PLoS Genet 5:e1000344
[Google Scholar]
Vallenet D., Labarre L., Rouy Z., Barbe V., Bocs S., Cruveiller S., Lajus A., Pascal G., Scarpelli C., Médigue C. 2006; MaGe: a microbial genome annotation system supported by synteny results. Nucleic Acids Res 34:53–65
[Google Scholar]
Windsor A. J., Mitchell-Olds T. 2006; Comparative genomics as a tool for gene discovery. Curr Opin Biotechnol 17:161–167
[Google Scholar]
Yada T., Totoki Y., Takagi T., Nakai K. 2001; A novel bacterial gene-finding system with improved accuracy in locating start codons. DNA Res 8:97–106
[Google Scholar]
Yates J. R. III, Eng J. K., McCormack A. L. 1995; Mining genomes: correlating tandem mass spectra of modified and unmodified peptides to sequences in nucleotide databases. Anal Chem 67:3202–3210
[Google Scholar]
Zhu H. Q., Hu G. Q., Ouyang Z. Q., Wang J., She Z. S. 2004; Accuracy improvement for identifying translation initiation sites in microbial genomes. Bioinformatics 20:3308–3317
[Google Scholar]

http://instance.metastore.ingenta.com/content/journal/micro/10.1099/mic.0.033811-0

Using comparative genome analysis to identify problems in annotated microbial genomes

Microbiology 156, 1909 (2010); https://doi.org/10.1099/mic.0.033811-0

/content/journal/micro/10.1099/mic.0.033811-0

Data & Media loading...

Supplements

Volume 156, Issue 7

Other

Free

Using comparative genome analysis to identify problems in annotated microbial genomes

Abstract

Supplementary material 1

Most read this month

Most cited Most Cited RSS feed

Generic Assignments, Strain Histories and Properties of Pure Cultures of Cyanobacteria

Metals, minerals and microbes: geomicrobiology and bioremediation

Quantification of biofilm structures by the novel computer program comstat

Clustered regularly interspaced short palindrome repeats (CRISPRs) have spacers of extrachromosomal origin

Autotrophic growth of anaerobic ammonium-oxidizing micro-organisms in a fluidized bed reactor

Plant-beneficial effects of Trichoderma and of its genes

The ecology, epidemiology and virulence of Enterococcus

Distribution of Menaquinones in Actinomycetes and Corynebacteria

Short motif sequences determine the targets of the prokaryotic CRISPR defence system

Microbe Profile: Pseudomonas aeruginosa: opportunistic pathogen and lab rat