1887

Abstract

Fast, efficient public health actions require well-organized and coordinated systems that can supply timely and accurate knowledge. Public databases of pathogen genomic data, such as the International Nucleotide Sequence Database Collaboration (INSDC), have become essential tools for efficient public health decisions. However, these international resources began primarily for academic purposes, rather than for surveillance or interventions. Now, queries need to access not only the whole genomes of multiple pathogens but also make connections using robust contextual metadata to identify issues of public health relevance. Databases that over time developed a patchwork of submission formats and requirements need to be consistently organized and coordinated internationally to allow effective searches.

To help resolve these issues, we propose a common pathogen data structure called the Pathogen Data Object Model (DOM) that will formalize the minimum pieces of sequence data and contextual data necessary for general public health uses, while recognizing that submitters will likely withhold a wide range of non-public contextual data. Further, we propose contributors use the Pathogen DOM for all pathogen submissions (bacterial, viral, fungal, and parasites), which will simplify data submissions and provide a consistent and transparent data structure for downstream data analyses. We also highlight how improved submission tools can support the Pathogen DOM, offering users additional easy-to-use methods to ensure this structure is followed.

  • This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.001145
2023-12-12
2024-04-29
Loading full text...

Full text loading...

/deliver/fulltext/mgen/9/12/mgen001145.html?itemId=/content/journal/mgen/10.1099/mgen.0.001145&mimeType=html&fmt=ahah

References

  1. Howe D, Costanzo M, Fey P, Gojobori T, Hannick L et al. Big data: the future of biocuration. Nature 2008; 455:47–50 [View Article]
    [Google Scholar]
  2. Cochrane G, Karsch-Mizrachi I, Nakamura Y. The International nucleotide sequence database collaboration. Nucleic Acids Res 2011; 39:D15–8 [View Article] [PubMed]
    [Google Scholar]
  3. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016; 3:160018 [View Article] [PubMed]
    [Google Scholar]
  4. Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995; 269:496–512 [View Article] [PubMed]
    [Google Scholar]
  5. Burks C, Cinkosky MJ, Gilna P, Hayden JE, Abe Y et al. GenBank: current status and future directions. Methods Enzymol 1990; 183:3–22 [View Article] [PubMed]
    [Google Scholar]
  6. Yilmaz P, Gilbert JA, Knight R, Amaral-Zettler L, Karsch-Mizrachi I et al. The genomic standards consortium: bringing standards to life for microbial ecology. ISME J 2011; 5:1565–1567 [View Article] [PubMed]
    [Google Scholar]
  7. Yilmaz P, Kottmann R, Field D, Knight R, Cole JR et al. Minimum information about a marker gene sequence (MIMARKS) and minimum information about any (x) sequence (MIxS) specifications. Nat Biotechnol 2011; 29:415–420 [View Article] [PubMed]
    [Google Scholar]
  8. Barrett T, Clark K, Gevorgyan R, Gorelenkov V, Gribov E et al. BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res 2012; 40:D57–63 [View Article] [PubMed]
    [Google Scholar]
  9. Allard MW, Strain E, Melka D, Bunning K, Musser SM et al. Practical value of food pathogen traceability through building a whole-genome sequencing network and database. J Clin Microbiol 2016; 54:1975–1983 [View Article] [PubMed]
    [Google Scholar]
  10. Ashton PM, Nair S, Peters TM, Bale JA, Powell DG et al. Identification of Salmonella for public health surveillance using whole genome sequencing. PeerJ 2016; 4:e1752 [View Article] [PubMed]
    [Google Scholar]
  11. Jackson BR, Tarr C, Strain E, Jackson KA, Conrad A et al. Implementation of nationwide real-time whole-genome sequencing to enhance listeriosis outbreak detection and investigation. Clin Infect Dis 2016; 63:380–386 [View Article]
    [Google Scholar]
  12. Ford L, Carter GP, Wang Q, Seemann T, Sintchenko V et al. Incorporating whole-genome sequencing into public health surveillance: lessons from prospective sequencing of Salmonella Typhimurium in Australia. Foodborne Pathog Dis 2018; 15:161–167 [View Article] [PubMed]
    [Google Scholar]
  13. Ceric O, Tyson GH, Goodman LB, Mitchell PK, Zhang Y et al. Enhancing the one health initiative by using whole genome sequencing to monitor antimicrobial resistance of animal pathogens: Vet-LIRN collaborative project with veterinary diagnostic laboratories in United States and Canada. BMC Vet Res 2019; 15:130 [View Article] [PubMed]
    [Google Scholar]
  14. EMBL-EBI ENA prokaryotic pathogen minimal sample checklist ERC000028; 2013 https://www.ebi.ac.uk/ena/browser/view/ERC000028 accessed 23 December 2023
  15. NCBI Package Pathogen: combined; version 1.0 [Internet]. 2013 [cited 2022 Dec 23]; 2013 https://submit.ncbi.nlm.nih.gov/biosample/template/?organism-organism_name=&organism-taxonomy_id=&package-0=Pathogen&package-1=Pathogen.combined.1.0&action=definition accessed 23 December 2023
  16. Timme RE, Sanchez Leon M, Allard MW. Utilizing the public genometrakr database for foodborne pathogen traceback. Methods Mol Biol Clifton NJ 2019; 1918:201–212 [View Article]
    [Google Scholar]
  17. Timme RE, Wolfgang WJ, Balkey M, Venkata SLG, Randolph R et al. Optimizing open data to support one health: best practices to ensure interoperability of genomic data from bacterial pathogens. One Health Outlook 2020; 2:20 [View Article] [PubMed]
    [Google Scholar]
  18. Quiñones M, Liou DT, Shyu C, Kim W, Vujkovic-Cvijin I et al. METAGENOTE: a simplified web platform for metadata annotation of genomic samples and streamlined submission to NCBI’s sequence read archive. BMC Bioinformatics 2020; 21:378 [View Article] [PubMed]
    [Google Scholar]
  19. Stevens EL, Carleton HA, Beal J, Tillman GE, Lindsey RL et al. Use of whole genome sequencing by the federal interagency collaboration for genomics for food and feed safety in the United States. J Food Prot 2022; 85:755–772 [View Article] [PubMed]
    [Google Scholar]
  20. Coordinators NR. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2016; 44:D7–D19 [View Article]
    [Google Scholar]
  21. Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J et al. AMRFinderPlus and the reference gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep 2021; 11:12728 [View Article] [PubMed]
    [Google Scholar]
  22. NLM-NCBI NCBI Pathogen Detection Project [Internet]. Bethesda, MD: National Library of Medicine (US), National Center for Biotechnology Information; 2016 https://www.ncbi.nlm.nih.gov/pathogens/ accessed 13 January 2016
  23. Llarena A, Ribeiro‐Gonçalves BF, Nuno Silva D, Halkilahti J, Machado MP et al. INNUENDO: a cross‐sectoral platform for the integration of genomics in the surveillance of food‐borne pathogens. EFS3 2018; 15: [View Article]
    [Google Scholar]
  24. Matthews TC, Bristow FR, Griffiths EJ, Petkau A, Adam J et al. The Integrated Rapid Infectious Disease Analysis (IRIDA) platform. Bioinformatics 2018 [View Article]
    [Google Scholar]
  25. Argimón S, Yeats CA, Goater RJ, Abudahab K, Taylor B et al. A global resource for genomic predictions of antimicrobial resistance and surveillance of Salmonella Typhi at pathogenwatch. Nat Commun 2021; 12:2879 [View Article] [PubMed]
    [Google Scholar]
  26. Olson RD, Assaf R, Brettin T, Conrad N, Cucinell C et al. Introducing the Bacterial and Viral Bioinformatics Resource Center (BV-BRC): a resource combining PATRIC, IRD and ViPR. Nucleic Acids Res 2023; 51:D678–D689 [View Article] [PubMed]
    [Google Scholar]
  27. Amid C, Pakseresht N, Silvester N, Jayathilaka S, Lund O et al. The COMPARE data hubs. Database 2019; 2019:baz136 [View Article] [PubMed]
    [Google Scholar]
  28. Rahman N, O’Cathail C, Zyoud A, Sokolov A, Munnink BO et al. Mobilisation and analyses of publicly available SARS-CoV-2 data for pandemic responses. Bioinformatics 20232023 [View Article]
    [Google Scholar]
  29. EMBL-EBI EMBL-EBI Pathogens portal [Internet]; 2023 https://www.ebi.ac.uk/ena/pathogens/v2/ accessed 12 April 2023
  30. Brister JR, Ako-adjei D, Bao Y, Blinkova O. NCBI viral genomes resource. Nucleic Acids Res 2015; 43:D571–D577 [View Article]
    [Google Scholar]
  31. Harrison PW, Lopez R, Rahman N, Allen SG, Aslam R et al. The COVID-19 Data Portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing. Nucleic Acids Res 2021; 49:W619–W623 [View Article] [PubMed]
    [Google Scholar]
  32. Cantelli G, Cochrane G, Brooksbank C, McDonagh E, Flicek P et al. The European bioinformatics institute: empowering cooperation in response to a global health crisis. Nucleic Acids Res 2021; 49:D29–D37 [View Article] [PubMed]
    [Google Scholar]
  33. Griffiths EJ, Timme RE, Mendes CI, Page AJ, Alikhan N-F et al. Future-proofing and maximizing the utility of metadata: the PHA4GE SARS-CoV-2 contextual data specification package. Gigascience 2022; 11:giac003 [View Article] [PubMed]
    [Google Scholar]
  34. Timme R, Griffiths E. SARS-CoV-2 NCBI submission workflow + guidance for structuring and releasing metadata v1. [View Article]
  35. Hatcher EL, Zhdanov SA, Bao Y, Blinkova O, Nawrocki EP et al. Virus variation resource - improved response to emergent viral outbreaks. Nucleic Acids Res 2017; 45:D482–D490 [View Article] [PubMed]
    [Google Scholar]
  36. NIH-NLM-NCBI NCBI Datasets Homepage. NCBI Datasets Homepage; n.d https://www.ncbi.nlm.nih.gov/datasets/ accessed 27 April 2023
  37. Nadon C, Van Walle I, Gerner-Smidt P, Campos J, Chinen I et al. PulseNet International: vision for the implementation of whole genome sequencing (WGS) for global food-borne disease surveillance. Euro Surveill 2017; 22:30544 [View Article] [PubMed]
    [Google Scholar]
  38. Chattaway MA, Dallman TJ, Larkin L, Nair S, McCormick J et al. The transformation of reference microbiology methods and surveillance for Salmonella with the use of whole genome sequencing in England and Wales. Front Public Health 2019; 7:317 [View Article] [PubMed]
    [Google Scholar]
  39. Timme R, Balkey M, Randolph R, Haendiges J, Laxmi Gubbala Venkata S. NCBI submission protocol for microbial pathogen surveillance v4; 2023 accessed 21 July 2020
  40. Cummins C, Ahamed A, Aslam R, Burgin J, Devraj R et al. The European nucleotide archive in 2021. Nucleic Acids Res 2022; 50:D106–D110 [View Article] [PubMed]
    [Google Scholar]
  41. Timme R, Griffiths E, MacCannell D, Katz L. SARS-CoV-2 NCBI submission protocol: SRA, BioSample, and BioProject V.3. protocols.io; 2021 https://www.protocols.io/view/sars-cov-2-ncbi-submission-protocol-sra-biosample-bui7nuhn?version_warning=no accessed 15 January 2021
  42. Davis S, Pettengill JB, Luo Y, Payne J, Shpuntoff A et al. CFSAN SNP Pipeline: an automated method for constructing SNP matrices from next-generation sequence data. PeerJ Comput Sci 2015; 1:e20 [View Article]
    [Google Scholar]
  43. Lythgoe KA, Hall M, Ferretti L, de Cesare M, MacIntyre-Cockett G et al. SARS-CoV-2 within-host diversity and transmission. Science 2021; 372:eabg0821 [View Article] [PubMed]
    [Google Scholar]
  44. Griffiths E, Mendes CI, Maguire F, Guthrie J, Chindelevitch L et al. PHA4GE quality control contextual data tags: standardized annotations for sharing public health sequence datasets with known quality issues to facilitate testing and training. Life Sci 2023 [View Article] [PubMed]
    [Google Scholar]
  45. Timme RE, Balkey M. NCBI submission protocol for SARS-CoV-2 wastewater data: SRA, BioSample, and BioProject V.8. Protocols.io; 2022 https://www.protocols.io/view/ncbi-submission-protocol-for-sars-cov-2-wastewater-ewov14w27vr2/v8 accessed 14 September 2022
  46. NLM-NCBI NCBI BioSample Templates. NCBI BioSample Templates; n.d https://submit.ncbi.nlm.nih.gov/biosample/template/ accessed 28 April 2023
  47. EMBL-EBI ENA Sample checklists. ENA Sample checklists; n.d https://www.ebi.ac.uk/ena/browser/checklists accessed 28 April 2023
  48. Connor R, Yarmosh DA, Maier W, Shakya M, Martin R et al. Towards increased accuracy and reproducibility in SARS-CoV-2 next generation sequence analysis for public health surveillance. Bioinformatics 2022 [View Article]
    [Google Scholar]
  49. Shu Y, McCauley J. GISAID: global initiative on sharing all influenza data - from vision to reality. Euro Surveill 2017; 22:30494 [View Article] [PubMed]
    [Google Scholar]
  50. Benton B, King S, Greenfield SR, Puthuveetil N, Reese AL et al. The ATCC Genome Portal: microbial genome reference standards with data provenance. Microbiol Resour Announc 2021; 10:e0081821 [View Article] [PubMed]
    [Google Scholar]
  51. Mukherjee S, Stamatis D, Li CT, Ovchinnikova G, Bertsch J et al. Twenty-five years of Genomes OnLine Database (GOLD): data updates and new features in v.9. Nucleic Acids Res 2023; 51:D957–D963 [View Article] [PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.001145
Loading
/content/journal/mgen/10.1099/mgen.0.001145
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error