Bakta: rapid and standardized annotation of bacterial genomes via alignment-free sequence identification Open Access

Abstract

Command-line annotation software tools have continuously gained popularity compared to centralized online services due to the worldwide increase of sequenced bacterial genomes. However, results of existing command-line software pipelines heavily depend on taxon-specific databases or sufficiently well annotated reference genomes. Here, we introduce Bakta, a new command-line software tool for the robust, taxon-independent, thorough and, nonetheless, fast annotation of bacterial genomes. Bakta conducts a comprehensive annotation workflow including the detection of small proteins taking into account replicon metadata. The annotation of coding sequences is accelerated via an alignment-free sequence identification approach that in addition facilitates the precise assignment of public database cross-references. Annotation results are exported in GFF3 and International Nucleotide Sequence Database Collaboration (INSDC)-compliant flat files, as well as comprehensive JSON files, facilitating automated downstream analysis. We compared Bakta to other rapid contemporary command-line annotation software tools in both targeted and taxonomically broad benchmarks including isolates and metagenomic-assembled genomes. We demonstrated that Bakta outperforms other tools in terms of functional annotations, the assignment of functional categories and database cross-references, whilst providing comparable wall-clock runtimes. Bakta is implemented in Python 3 and runs on MacOS and Linux systems. It is freely available under a GPLv3 license at https://github.com/oschwengers/bakta. An accompanying web version is available at https://bakta.computational.bio.

Funding
This study was supported by the:
  • BMBF (Award 031A533)
    • Principle Award Recipient: AlexanderGoesmann
  • BMBF (Award 031L0209A)
    • Principle Award Recipient: AlexanderGoesmann
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000685
2021-11-05
2024-03-29
Loading full text...

Full text loading...

/deliver/fulltext/mgen/7/11/mgen000685.html?itemId=/content/journal/mgen/10.1099/mgen.0.000685&mimeType=html&fmt=ahah

References

  1. Meyer F, Goesmann A, McHardy AC, Bartels D, Bekel T et al. GenDB – an open source genome annotation system for prokaryote genomes. Nucleic Acids Res 2003; 31:2187–2195 [View Article] [PubMed]
    [Google Scholar]
  2. Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A et al. BASys: a web server for automated bacterial genome annotation. Nucleic Acids Res 2005; 33:W455–W459 [View Article]
    [Google Scholar]
  3. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T et al. The RAST server: Rapid Annotations using Subsystems Technology. BMC Genomics 2008; 9:75 [View Article]
    [Google Scholar]
  4. Haft DH, DiCuccio M, Badretdin A, Brover V, Chetvernin V et al. RefSeq: an update on prokaryotic genome annotation and curation. Nucleic Acids Res 2018; 46:D851–D860 [View Article]
    [Google Scholar]
  5. Dong Y, Li C, Kim K, Cui L, Liu X. Genome annotation of disease-causing microorganisms. Brief Bioinform 2021; 22:845–854 [View Article]
    [Google Scholar]
  6. Seemann T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 2014; 30:2068–2069 [View Article]
    [Google Scholar]
  7. Tanizawa Y, Fujisawa T, Nakamura Y. DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication. Bioinformatics 2018; 34:1037–1039 [View Article]
    [Google Scholar]
  8. Quijada NM, Rodríguez-Lázaro D, Eiros JM, Hernández M. TORMES: an automated pipeline for whole bacterial genome analysis. Bioinformatics 2019; 35:4207–4212 [View Article]
    [Google Scholar]
  9. Schwengers O, Hoek A, Fritzenwanker M, Falgenhauer L, Hain T et al. ASA3P: an automatic and scalable pipeline for the assembly, annotation and higher-level analysis of closely related bacterial isolates. PLoS Comput Biol 2020; 16:e1007134
    [Google Scholar]
  10. Petit RA, Read TD. Bactopia: a flexible pipeline for complete analysis of bacterial genomes. mSystems 2020; 5:e00190-20 [View Article]
    [Google Scholar]
  11. Seemann T. Nullarbor. Github. https://github.com/tseemann/nullarbor accessed 25 Sep 2020
  12. Lobb B, Tremblay BJ-M, Moreno-Hagelsieb G, Doxey AC. An assessment of genome annotation coverage across the bacterial tree of life. Microb Genom 2020; 6:000341 [View Article]
    [Google Scholar]
  13. Wassarman KM, Repoila F, Rosenow C, Storz G, Gottesman S. Identification of novel small RNAs using comparative genomics and microarrays. Genes Dev 2001; 15:1637–1651 [View Article]
    [Google Scholar]
  14. Noguchi H, Taniguchi T, Itoh T. MetaGeneAnnotator: detecting species-specific patterns of ribosomal binding site for precise gene prediction in anonymous prokaryotic and phage genomes. DNA Res 2008; 15:387–396 [View Article]
    [Google Scholar]
  15. Hyatt D, Chen G-L, LoCascio PF, Land ML, Larimer FW et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics 2010; 11:119 [View Article]
    [Google Scholar]
  16. Li W, O’Neill KR, Haft DH, DiCuccio M, Chetvernin V et al. RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation. Nucleic Acids Res 2021; 49:D1020–D1028 [View Article]
    [Google Scholar]
  17. UniProt Consortium UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res 2021; 49:D480–D489 [View Article]
    [Google Scholar]
  18. Chan PP, Lin BY, Mak AJ, Lowe TM. tRNAscan-SE 2.0: improved detection and functional classification of transfer RNA genes. bioRxiv 2019614032
    [Google Scholar]
  19. Laslett D, Canback B. ARAGORN, a program to detect tRNA genes and tmRNA genes in nucleotide sequences. Nucleic Acids Res 2004; 32:11–16 [View Article]
    [Google Scholar]
  20. Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics 2013; 29:2933–2935 [View Article]
    [Google Scholar]
  21. Kalvari I, Nawrocki EP, Ontiveros-Palacios N, Argasinska J, Lamkiewicz K et al. Rfam 14: expanded coverage of metagenomic, viral and microRNA families. Nucleic Acids Res 2021; 49:D192–D200 [View Article]
    [Google Scholar]
  22. Edgar RC. PILER-CR: fast and accurate identification of CRISPR repeats. BMC Bioinformatics 2007; 8:18 [View Article]
    [Google Scholar]
  23. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J et al. BLAST+: architecture and applications. BMC Bioinformatics 2009; 10:421 [View Article]
    [Google Scholar]
  24. Luo H, Gao F. DoriC 10.0: an updated database of replication origins in prokaryotic genomes including chromosomes and plasmids. Nucleic Acids Res 2019; 47:D74–D77 [View Article]
    [Google Scholar]
  25. Robertson J, Bessonov K, Schonfeld J, Nash JHE. Universal whole-sequence-based plasmid typing and its utility to prediction of host range and epidemiological surveillance. Microb Genom 2020; 6:e000435 [View Article]
    [Google Scholar]
  26. Cock PJA, Antao T, Chang JT, Chapman BA, Cox CJ et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009; 25:1422–1423 [View Article]
    [Google Scholar]
  27. Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol 2011; 7:e1002195 [View Article]
    [Google Scholar]
  28. Eberhardt RY, Haft DH, Punta M, Martin M, O’Donovan C et al. Antifam: a tool to help identify spurious ORFs in protein annotation. Database 2012; 2012:bas003 [View Article]
    [Google Scholar]
  29. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods 20151259–60 [View Article]
    [Google Scholar]
  30. Galperin MY, Wolf YI, Makarova KS, Vera Alvarez R, Landsman D et al. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res 2021; 49:D274–D281 [View Article]
    [Google Scholar]
  31. Artimo P, Jonnalagedda M, Arnold K, Baratin D, Csardi G et al. ExPASy: SIB bioinformatics resource portal. Nucleic Acids Res 2012; 40:W597–W603 [View Article]
    [Google Scholar]
  32. Gene Ontology Consortium The Gene Ontology resource: enriching a GOld mine. Nucleic Acids Res 2021; 49:D325–D334 [View Article]
    [Google Scholar]
  33. Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ et al. Validating the AMRFinder tool and resistance gene database by using antimicrobial resistance genotype-phenotype correlations in a collection of isolates. Antimicrob Agents Chemother 2019; 63:e00483-19 [View Article]
    [Google Scholar]
  34. Liu B, Zheng D, Jin Q, Chen L, Yang J. VFDB 2019: a comparative pathogenomic platform with an interactive web interface. Nucleic Acids Res 2019; 47:D687–D692 [View Article]
    [Google Scholar]
  35. El-Gebali S, Mistry J, Bateman A, Eddy SR, Luciani A et al. The Pfam protein families database in 2019. Nucleic Acids Res 2019; 47:D427–D432 [View Article]
    [Google Scholar]
  36. Robertson J, Nash JHE. MOB-suite: software tools for clustering, reconstruction and typing of plasmids from draft assemblies. Microb Genom 2018; 4:e000206 [View Article]
    [Google Scholar]
  37. Kamoun C, Payen T, Hua-Van A, Filée J, Delihas N et al. Improving prokaryotic transposable elements identification using a combination of de novo and profile HMM methods. BMC Genomics 2013; 14:700 [View Article]
    [Google Scholar]
  38. Kämpfer P, Fuglsang-Damgaard D, Overballe-Petersen S, Hasman H, Hammerum AM et al. Taxonomic reassessment of the genus Pseudocitrobacter using whole genome sequencing: Pseudocitrobacter anthropi is a later heterotypic synonym of Pseudocitrobacter faecalis and description of Pseudocitrobacter vendiensis sp. nov. Int J Syst Evol Microbiol 2020; 70:1315–1320 [View Article]
    [Google Scholar]
  39. Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 2018; 34:i884–i890 [View Article]
    [Google Scholar]
  40. Wick RR, Judd LM, Gorrie CL, Holt KE. Unicycler: resolving bacterial genome assemblies from short and long sequencing reads. PLoS Comput Biol 2017; 13:e1005595 [View Article]
    [Google Scholar]
  41. Storz G, Wolf YI, Ramamurthi KS. Small proteins can no longer be ignored. Annu Rev Biochem 2014; 83:753–777 [View Article]
    [Google Scholar]
  42. Berube BJ, Sampedro GR, Otto M, Bubeck Wardenburg J. The psmα locus regulates production of Staphylococcus aureus alpha-toxin during infection. Infect Immun 2014; 82:3350–3358 [View Article]
    [Google Scholar]
  43. Cheung GYC, Joo H-S, Chatterjee SS, Otto M. Phenol-soluble modulins--critical determinants of staphylococcal virulence. FEMS Microbiol Rev 2014; 38:698–719 [View Article]
    [Google Scholar]
  44. Ebmeier SE, Tan IS, Clapham KR, Ramamurthi KS. Small proteins link coat and cortex assembly during sporulation in Bacillus subtilis. Mol Microbiol [Internet] 2012; 84:682–696 [View Article]
    [Google Scholar]
  45. Chen L-X, Anantharaman K, Shaiber A, Eren AM, Banfield JF. Accurate and complete genomes from metagenomes. Genome Res 2020; 30:315–333 [View Article]
    [Google Scholar]
  46. Parks DH, Rinke C, Chuvochina M, Chaumeil P-A, Woodcroft BJ et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat Microbiol 2017; 2:1533–1542 [View Article]
    [Google Scholar]
  47. Gaio D, DeMaere MZ, Anantanawat K, Chapman TA, Djordjevic SP et al. Post-weaning shifts in microbiome composition and metabolism revealed by over 25 000 pig gut metagenome-assembled genomes. Microb Genom 2021; 7:e000501 [View Article]
    [Google Scholar]
  48. Nayfach S, Roux S, Seshadri R, Udwary D, Varghese N et al. A genomic catalog of Earth’s microbiomes. Nat Biotechnol 2021; 39:499–509
    [Google Scholar]
  49. Xie F, Jin W, Si H, Yuan Y, Tao Y et al. An integrated gene catalog and over 10000 metagenome-assembled genomes from the gastrointestinal microbiome of ruminants. Microbiome 2021; 9:137
    [Google Scholar]
  50. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 2015; 25:1043–1055 [View Article]
    [Google Scholar]
  51. Harrison PW, Alako B, Amid C, Cerdeño-Tárraga A, Cleland I et al. The European Nucleotide Archive in 2018. Nucleic Acids Res 2019; 47:D84–D88 [View Article]
    [Google Scholar]
  52. Yachdav G, Goldberg T, Wilzbach S, Dao D, Shih I et al. Anatomy of BioJS, an open source community for the life sciences. elife 2015; 4:e07009 [View Article]
    [Google Scholar]
  53. Robinson JT, Thorvaldsdóttir H, Turner D, Mesirov JP. igv.js: an embeddable JavaScript implementation of the Integrative Genomics Viewer (IGV). bioRxiv 2020075499
    [Google Scholar]
  54. Durrant MG, Bhatt AS. Automated prediction and annotation of small open reading frames in microbial genomes. Cell Host Microbe 2021; 29:121–31
    [Google Scholar]
  55. Li L, Chao Y. sPepFinder expedites genome-wide identification of small proteins in bacteria. bioRxiv 2020079178
    [Google Scholar]
  56. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016; 3:160018
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000685
Loading
/content/journal/mgen/10.1099/mgen.0.000685
Loading

Data & Media loading...

Supplements

Loading data from figshare Loading data from figshare

Most cited Most Cited RSS feed