1887

Abstract

Capturing the published corpus of information on all members of a given protein family should be an essential step in any study focusing on specific members of that family. Using a previously gathered dataset of more than 280 references mentioning a member of the DUF34 (NIF3/Ngg1-interacting Factor 3) family, we evaluated the efficiency of different databases and search tools, and devised a workflow that experimentalists can use to capture the most information published on members of a protein family in the least amount of time. To complement this workflow, web-based platforms allowing for the exploration of protein family members across sequenced genomes or for the analysis of gene neighbourhood information were reviewed for their versatility and ease of use. Recommendations that can be used for experimentalist users, as well as educators, are provided and integrated within a customized, publicly accessible Wiki.

Funding
This study was supported by the:
  • National Institute of General Medical Sciences (Award GM70641)
    • Principle Award Recipient: Valériede Crécy-Lagard
  • This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.001183
2024-02-07
2024-05-20
Loading full text...

Full text loading...

/deliver/fulltext/mgen/10/2/mgen001183.html?itemId=/content/journal/mgen/10.1099/mgen.0.001183&mimeType=html&fmt=ahah

References

  1. Reed C. Supplemental data S1-S10 Figshare 2024 https://doi.org/10.6084/m9.figshare.25145735.v1
    [Google Scholar]
  2. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995; 269:496–512 [View Article] [PubMed]
    [Google Scholar]
  3. Bansal AK. Bioinformatics in microbial biotechnology–a mini review. Microb Cell Fact 2005; 4:19 [View Article] [PubMed]
    [Google Scholar]
  4. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol 1990; 215:403–410 [View Article] [PubMed]
    [Google Scholar]
  5. Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA et al. GenBank. Nucleic Acids Res 2000; 28:15–18 [View Article] [PubMed]
    [Google Scholar]
  6. Hug LA, Baker BJ, Anantharaman K, Brown CT, Probst AJ et al. A new view of the tree of life. Nat Microbiol 2016; 1:16048 [View Article] [PubMed]
    [Google Scholar]
  7. Jagadeesan B, Gerner-Smidt P, Allard MW, Leuillet S, Winkler A et al. The use of next generation sequencing for improving food safety: translation into practice. Food Microbiol 2019; 79:96–115 [View Article] [PubMed]
    [Google Scholar]
  8. Quainoo S, Coolen JPM, van Hijum SAFT, Huynen MA, Melchers WJG et al. Whole-genome sequencing of bacterial pathogens: the future of nosocomial outbreak analysis. Clin Microbiol Rev 2017; 30:1015–1063 [View Article] [PubMed]
    [Google Scholar]
  9. Zallot R, Oberg N, Gerlt JA. The EFI Web Resource for Genomic Enzymology Tools: leveraging protein, genome, and metagenome databases to discover novel enzymes and metabolic pathways. Biochemistry 2019; 58:4169–4182 [View Article] [PubMed]
    [Google Scholar]
  10. Klimke W, O’Donovan C, White O, Brister JR, Clark K et al. Solving the problem: genome annotation standards before the data deluge. Stand Genomic Sci 2011; 5:168–193 [View Article] [PubMed]
    [Google Scholar]
  11. Shade A, Teal TK. Computing workflows for biologists: a roadmap. PLoS Biol 2015; 13:e1002303 [View Article] [PubMed]
    [Google Scholar]
  12. Zhulin IB. Databases for microbiologists. J Bacteriol 2015; 197:2458–2467 [View Article] [PubMed]
    [Google Scholar]
  13. Vallenet D, Calteau A, Dubois M, Amours P, Bazin A et al. MicroScope: an integrated platform for the annotation and exploration of microbial gene functions through genomic, pangenomic and metabolic comparative analysis. Nucleic Acids Res 2020; 48:D579–D589 [View Article] [PubMed]
    [Google Scholar]
  14. Chen I-MA, Chu K, Palaniappan K, Pillay M, Ratner A et al. IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes. Nucleic Acids Res 2019; 47:D666–D677 [View Article]
    [Google Scholar]
  15. Arkin AP, Cottingham RW, Henry CS, Harris NL, Stevens RL et al. KBase: The United States Department of energy systems biology knowledgebase. Nat Biotechnol 2018; 36:566–569 [View Article] [PubMed]
    [Google Scholar]
  16. Davis JJ, Wattam AR, Aziz RK, Brettin T, Butler R et al. The PATRIC Bioinformatics Resource Center: expanding data and analysis capabilities. Nucleic Acids Res 2020; 48:D606–D612 [View Article] [PubMed]
    [Google Scholar]
  17. Karp PD, Ivanova N, Krummenacker M, Kyrpides N, Latendresse M et al. A comparison of microbial genome web portals. Front Microbiol 2019; 10:208 [View Article] [PubMed]
    [Google Scholar]
  18. Borda S. If data is used in the forest and no-one is around to hear it, did it happen? A citation count investigation. Int J Digit Curation 2023; 17:14 [View Article]
    [Google Scholar]
  19. Blake JA, Bult CJ. Beyond the data deluge: data integration and bio-ontologies. J Biomed Inform 2006; 39:314–320 [View Article] [PubMed]
    [Google Scholar]
  20. White J. Pubmed 2.0. MED Ref Serv Q 2020; 39:382–387 [View Article] [PubMed]
    [Google Scholar]
  21. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016; 3:160018 [View Article] [PubMed]
    [Google Scholar]
  22. Wei C-H, Allot A, Leaman R, Lu Z. PubTator central: automated concept annotation for biomedical full text articles. Nucleic Acids Res 2019; 47:W587–W593 [View Article] [PubMed]
    [Google Scholar]
  23. Gerlt JA. The need for manuscripts to include database identifiers for proteins. Biochemistry 2018; 57:4239–4240 [View Article]
    [Google Scholar]
  24. Wang Y, Wang Q, Huang H, Huang W, Chen Y et al. A crowdsourcing open platform for literature curation in UniProt. PLoS Biol 2021; 19:e3001464 [View Article] [PubMed]
    [Google Scholar]
  25. Reed CJ, Hutinet G, de Crécy-Lagard V. Comparative genomic analysis of the DUF34 protein family suggests role as a metal ion chaperone or insertase. Biomolecules 2021; 11:1282 [View Article] [PubMed]
    [Google Scholar]
  26. Finn RD, Mistry J, Tate J, Coggill P, Heger A et al. The Pfam protein families database. Nucleic Acids Res 2010; 38:D211–22 [View Article] [PubMed]
    [Google Scholar]
  27. Blum M, Chang H-Y, Chuguransky S, Grego T, Kandasaamy S et al. The InterPro protein families and domains database: 20 years on. Nucleic Acids Res 2021; 49:D344–D354 [View Article] [PubMed]
    [Google Scholar]
  28. Lu S, Wang J, Chitsaz F, Derbyshire MK, Geer RC et al. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res 2020; 48:D265–D268 [View Article]
    [Google Scholar]
  29. Huerta-Cepas J, Szklarczyk D, Heller D, Hernández-Plaza A, Forslund SK et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res 2019; 47:D309–D314 [View Article]
    [Google Scholar]
  30. Bateman A. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res 2019; 47:D506–D515 [View Article]
    [Google Scholar]
  31. Bethesda (MD) National Library of Medicine (US), N.C. for B.I. National Center for Biotechnology Information (NCBI) [Internet]. n.d https://www.ncbi.nlm.nih.gov/ [PubMed]
  32. Nordberg H, Cantor M, Dusheyko S, Hua S, Poliakov A et al. The genome portal of the Department of Energy Joint Genome Institute: 2014 updates. Nucleic Acids Res 2014; 42:D26–31 [View Article] [PubMed]
    [Google Scholar]
  33. Olson RD, Assaf R, Brettin T, Conrad N, Cucinell C et al. Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR. Nucleic Acids Res 2023; 51:D678–D689 [View Article] [PubMed]
    [Google Scholar]
  34. Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 1999; 96:2896–2901 [View Article]
    [Google Scholar]
  35. Pejaver VR, Lee H, Kim S. Gene cluster prediction and its application to genome annotation. In Kihara D. eds In Protein Function Prediction for Omics Era Springer Netherlands: Dordrecht; 2011 pp 35–54 [View Article]
    [Google Scholar]
  36. Altenhoff AM, Glover NM, Dessimoz C. Inferring orthology and paralogy. In Methods in Molecular Biology vol 1910 New York, New York: Humana Press; 2019 pp 149–175 [View Article]
    [Google Scholar]
  37. Gurska D, Jentzsch IMV, Panfilio KA. Mutual regulation underlies paralogue functional diversification. bioRxiv 2019427245 [View Article]
    [Google Scholar]
  38. Mirny LA, Gelfand MS. Using orthologous and paralogous proteins to identify specificity determining residues. Genome Biol 2002; 3: [View Article] [PubMed]
    [Google Scholar]
  39. Zallot R, Harrison KJ, Kolaczkowski B, de Crécy-Lagard V. Functional annotations of paralogs: a blessing and a curse. Life 2016; 6:39 [View Article] [PubMed]
    [Google Scholar]
  40. Griss J, Côté RG, Gerner C, Hermjakob H, Vizcaíno JA. Published and perished? The influence of the searched protein database on the long-term storage of proteomics data. Mol Cell Proteomics 2011; 10:M111 [View Article] [PubMed]
    [Google Scholar]
  41. Li W, Cong Q, Kinch LN, Grishin NV. Seq2Ref: a web server to facilitate functional interpretation. BMC Bioinformatics 2013; 14:30 [View Article] [PubMed]
    [Google Scholar]
  42. Jaroszewski L, Koska L, Sedova M, Godzik A. PubServer: literature searches by homology. Nucleic Acids Res 2014; 42:W430–5 [View Article] [PubMed]
    [Google Scholar]
  43. Price MN, Arkin AP. PaperBLAST: text mining papers for information about homologs. mSystems 2017; 2:1–10 [View Article] [PubMed]
    [Google Scholar]
  44. de Crécy-Lagard V, Amorin de Hegedus R, Arighi C, Babor J, Bateman A et al. A roadmap for the functional annotation of protein families: a community perspective. Database (Oxford) 2022; 2022:1–16 [View Article] [PubMed]
    [Google Scholar]
  45. Novin A, Meyers E. Making sense of conflicting science information. In Proceedings of the 2017 Conference on Conference Human Information Interaction and Retrieval New York, NY, USA: ACM; 2017 pp 175–184 [View Article]
    [Google Scholar]
  46. Meng S. Availability heuristic will affect decision-making and result in bias. dtssehs 2017267–272 [View Article]
    [Google Scholar]
  47. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang H-Y et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005; 33:5691–5702 [View Article] [PubMed]
    [Google Scholar]
  48. Saha CK, Sanches Pires R, Brolin H, Delannoy M, Atkinson GC. FlaGs and webFlaGs: discovering novel biology through the analysis of gene neighbourhood conservation. Bioinformatics 2021; 37:1312–1314 [View Article] [PubMed]
    [Google Scholar]
  49. Knox HL, Allen KN. Expanding the viewpoint: Leveraging sequence information in enzymology. Curr Opin Chem Biol 2023; 72:102246 [View Article] [PubMed]
    [Google Scholar]
  50. Copp JN, Anderson DW, Akiva E, Babbitt PC, Tokuriki N. Exploring the sequence, function, and evolutionary space of protein superfamilies using sequence similarity networks and phylogenetic reconstructions. In Methods in Enzymology vol 620 Elsevier Inc; 2019 pp 315–347
    [Google Scholar]
  51. Oberg N, Zallot R, Gerlt JA. EFI-EST, EFI-GNT, and EFI-CGFP: Enzyme Function Initiative (EFI) web resource for genomic enzymology tools. J Mol Biol 2023; 435:168018 [View Article] [PubMed]
    [Google Scholar]
  52. Mendler K, Chen H, Parks DH, Lobb B, Hug LA et al. AnnoTree: visualization and exploration of a functionally annotated microbial tree of life. Nucleic Acids Res 2019; 47:4442–4448 [View Article] [PubMed]
    [Google Scholar]
  53. Klimchuk OI, Konovalov KA, Perekhvatov VV, Skulachev KV, Dibrova DV et al. COGNAT: a web server for comparative analysis of genomic neighborhoods. Biol Direct 2017; 12:26 [View Article] [PubMed]
    [Google Scholar]
  54. Pedreira T, Elfmann C, Stülke J. The current state of SubtiWiki, the database for the model organism Bacillus subtilis. Nucleic Acids Res 2022; 50:D875–D882 [View Article] [PubMed]
    [Google Scholar]
  55. Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res 2016; 44:D457–62 [View Article] [PubMed]
    [Google Scholar]
  56. Price MN, Arkin AP. A fast comparative genome browser for diverse bacteria and archaea. Bioinformatics 20231–17 [View Article]
    [Google Scholar]
  57. Gilchrist CLM, Chooi Y-H, Robinson P. Clinker & clustermap.js: automatic generation of gene cluster comparison figures. Bioinformatics 2021; 37:2473–2475 [View Article]
    [Google Scholar]
  58. Persson E, Castresana-Aguirre M, Buzzao D, Guala D, Sonnhammer ELL. FunCoup 5: functional association networks in all domains of life, supporting directed links and tissue-specificity. J Mol Biol 2021; 433:166835 [View Article] [PubMed]
    [Google Scholar]
  59. Sonnhammer ELL, Östlund G. InParanoid 8: orthology analysis between 273 proteomes, mostly eukaryotic. Nucleic Acids Res 2015; 43:D234–9 [View Article] [PubMed]
    [Google Scholar]
  60. Jalili V, Afgan E, Gu Q, Clements D, Blankenberg D et al. The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2020 update. Nucleic Acids Res 2020; 48:W395–W402 [View Article] [PubMed]
    [Google Scholar]
  61. Kasif S, Roberts RJ. We need to keep a reproducible trace of facts, predictions, and hypotheses from gene to function in the era of big data. PLoS Biol 2020; 18:e3000999 [View Article] [PubMed]
    [Google Scholar]
  62. Hernández-Plaza A, Szklarczyk D, Botas J, Cantalapiedra CP, Giner-Lamia J et al. eggNOG 6.0: enabling comparative genomics across 12 535 organisms. Nucleic Acids Res 2023; 51:D389–D394 [View Article] [PubMed]
    [Google Scholar]
  63. Attwood TK, Agit B, Ellis LBM. Longevity of biological databases. EMBnet J 2015; 21:1–8 [View Article]
    [Google Scholar]
  64. Kern F, Fehlmann T, Keller A. On the lifetime of bioinformatics web services. Nucleic Acids Res 2020; 48:12523–12533 [View Article] [PubMed]
    [Google Scholar]
  65. Ison J, Rapacki K, Ménager H, Kalaš M, Rydza E et al. Tools and data services registry: a community effort to document bioinformatics resources. Nucleic Acids Res 2016; 44:D38–47 [View Article] [PubMed]
    [Google Scholar]
  66. Ma L, Zou D, Liu L, Shireen H, Abbasi AA et al. Database commons: a catalog of worldwide biological databases. Genom Proteom Bioinform 2022 [View Article] [PubMed]
    [Google Scholar]
  67. Rigden DJ, Fernández XM. The 2022 nucleic acids research database issue and the online molecular biology database collection. Nucleic Acids Res 2022; 50:D1–D10 [View Article] [PubMed]
    [Google Scholar]
  68. Mulder N, Schwartz R, Brazas MD, Brooksbank C, Gaeta B et al. The development and application of bioinformatics core competencies to improve bioinformatics training and education. PLoS Comput Biol 2018; 14:e1005772 [View Article] [PubMed]
    [Google Scholar]
  69. Sansone S-A, McQuilton P, Rocca-Serra P, Gonzalez-Beltran A, Izzo M et al. FAIRsharing as a community approach to standards, repositories and policies. Nat Biotechnol 2019; 37:358–367 [View Article] [PubMed]
    [Google Scholar]
  70. Mathers BJ, L’Hours H. Increasing the reuse of data through FAIR-enabling the certification of trustworthy digital repositories. IJDC 1970; 17:5 [View Article]
    [Google Scholar]
  71. Zhao M, Yan E, Li K. Data set mentions and citations: a content analysis of full‐text publications. Asso for Info Science & Tech 2018; 69:32–46 [View Article]
    [Google Scholar]
  72. Silvello G. Theory and practice of data citation. Asso for Info Science & Tech 2018; 69:6–20 [View Article]
    [Google Scholar]
  73. Kafkas Ş, Kim J-H, McEntyre JR. Database citation in full text biomedical articles. PLoS One 2013; 8:e63184 [View Article] [PubMed]
    [Google Scholar]
  74. Piwowar HA, Vision TJ. Data reuse and the open data citation advantage. PeerJ 2013; 1:e175 [View Article] [PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.001183
Loading
/content/journal/mgen/10.1099/mgen.0.001183
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error