1887

Abstract

As public health laboratories expand their genomic sequencing and bioinformatics capacity for the surveillance of different pathogens, labs must carry out robust validation, training, and optimization of wet- and dry-lab procedures. Achieving these goals for algorithms, pipelines and instruments often requires that lower quality datasets be made available for analysis and comparison alongside those of higher quality. This range of data quality in reference sets can complicate the sharing of sub-optimal datasets that are vital for the community and for the reproducibility of assays. Sharing of useful, but sub-optimal datasets requires careful annotation and documentation of known issues to enable appropriate interpretation, avoid being mistaken for better quality information, and for these data (and their derivatives) to be easily identifiable in repositories. Unfortunately, there are currently no standardized attributes or mechanisms for tagging poor-quality datasets, or datasets generated for a specific purpose, to maximize their utility, searchability, accessibility and reuse. The Public Health Alliance for Genomic Epidemiology (PHA4GE) is an international community of scientists from public health, industry and academia focused on improving the reproducibility, interoperability, portability, and openness of public health bioinformatic software, skills, tools and data. To address the challenges of sharing lower quality datasets, PHA4GE has developed a set of standardized contextual data tags, namely fields and terms, that can be included in public repository submissions as a means of flagging pathogen sequence data with known quality issues, increasing their discoverability. The contextual data tags were developed through consultations with the community including input from the International Nucleotide Sequence Data Collaboration (INSDC), and have been standardized using ontologies - community-based resources for defining the tag properties and the relationships between them. The standardized tags are agnostic to the organism and the sequencing technique used and thus can be applied to data generated from any pathogen using an array of sequencing techniques. The tags can also be applied to synthetic (lab created) data. The list of standardized tags is maintained by PHA4GE and can be found at https://github.com/pha4ge/contextual_data_QC_tags. Definitions, ontology IDs, examples of use, as well as a JSON representation, are provided. The PHA4GE QC tags were tested, and are now implemented, by the FDA’s GenomeTrakr laboratory network as part of its routine submission process for SARS-CoV-2 wastewater surveillance. We hope that these simple, standardized tags will help improve communication regarding quality control in public repositories, in addition to making datasets of variable quality more easily identifiable. Suggestions for additional tags can be submitted to PHA4GE via the New Term Request Form in the GitHub repository. By providing a mechanism for feedback and suggestions, we also expect that the tags will evolve with the needs of the community.

Funding
This study was supported by the:
  • Biotechnology and Biological Sciences Research Council
    • Principle Award Recipient: BryanA Wee
  • National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health
    • Principle Award Recipient: IleneKarsch-Mizrachi
  • MRC Centre for Global Infectious Disease Analysis (Award MR/R015600/1)
    • Principle Award Recipient: LeonidChindelevitch
  • Public Health Agency of Canada (Award # 2223-HQ-000265)
    • Principle Award Recipient: DamionDooley
  • Public Health Agency of Canada (Award # 2223-HQ-000265)
    • Principle Award Recipient: RhiannonCameron
  • Public Health Agency of Canada (Award # 2223-HQ-000265)
    • Principle Award Recipient: EmmaJ Griffiths
  • This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.001260
2024-06-11
2024-06-14
Loading full text...

Full text loading...

/deliver/fulltext/mgen/10/6/mgen001260.html?itemId=/content/journal/mgen/10.1099/mgen.0.001260&mimeType=html&fmt=ahah

References

  1. Brown B, Allard M, Bazaco MC, Blankenship J, Minor T. An economic evaluation of the whole genome sequencing source tracking program in the U.S. PLoS One 2021; 16:e0258262 [View Article] [PubMed]
    [Google Scholar]
  2. Cook S. Genomic surveillance in the roll out of vaccines. PHG Foundation; 2021 https://www.phgfoundation.org/blog/genomic-surveillance-in-the-roll-out-of-vaccines accessed 12 January 2023
  3. Hendriksen RS, Bortolaia V, Tate H, Tyson GH, Aarestrup FM et al. Using genomics to track global antimicrobial resistance. Front Public Health 2019; 7:242 [View Article] [PubMed]
    [Google Scholar]
  4. Oude Munnink BB, Sikkema RS, Nieuwenhuijse DF, Molenaar RJ, Munger E et al. Transmission of SARS-CoV-2 on mink farms between humans and mink and back to humans. Science 2021; 371:172–177 [View Article] [PubMed]
    [Google Scholar]
  5. Petrillo M, Fabbri M, Kagkli DM, Querci M, Van den Eede G et al. A roadmap for the generation of benchmarking resources for antimicrobial resistance detection using next generation sequencing [version 2; peer review: 1 approved, 2 approved with reservations]. F1000Res 2021; 10:80 [View Article] [PubMed]
    [Google Scholar]
  6. Robinson ER, Walker TM, Pallen MJ. Genomics and outbreak investigation: from sequence to consequence. Genome Med 2013; 5:36 [View Article] [PubMed]
    [Google Scholar]
  7. World Health Organization 2022 https://www.who.int/initiatives/genomic-surveillance-strategy accessed 12 January 2023
  8. Rick JA, Brock CD, Lewanski AL, Golcher-Benavides J, Wagner CE. Reference genome choice and filtering thresholds jointly influence phylogenomic analyses. bioRxiv 2022 [View Article]
    [Google Scholar]
  9. Smits THM. The importance of genome sequence quality to microbial comparative genomics. BMC Genom 2019; 20:662 [View Article] [PubMed]
    [Google Scholar]
  10. Gargis AS, Kalman L, Lubin IM. Assuring the quality of next-generation sequencing in clinical microbiology and public health laboratories. J Clin Microbiol 2016; 54:2857–2865 [View Article] [PubMed]
    [Google Scholar]
  11. Rossen JWA, Friedrich AW, Moran-Gilad J. ESCMID Study Group for Genomic and Molecular Diagnostics (ESGMD) Practical issues in implementing whole-genome-sequencing in routine diagnostic microbiology. Clin Microbiol Infect 2018; 24:355–360 [View Article] [PubMed]
    [Google Scholar]
  12. Wagner DD, Carleton HA, Trees E, Katz LS. Evaluating whole-genome sequencing quality metrics for enteric pathogen outbreaks. PeerJ 2021; 9:e12446 [View Article] [PubMed]
    [Google Scholar]
  13. Carrillo CD, Blais BW. Whole-genome sequence datasets: a powerful resource for the food microbiology laboratory toolbox. Front Sustain Food Syst 2021; 5:754988 [View Article]
    [Google Scholar]
  14. Xiaoli L, Hagey JV, Park DJ, Gulvik CA, Young EL et al. Benchmark datasets for SARS-CoV-2 surveillance bioinformatics. PeerJ 2022; 10:e13821 [View Article] [PubMed]
    [Google Scholar]
  15. Griffiths EJ, Timme RE, Mendes CI, Page AJ, Alikhan N-F et al. Future-proofing and maximizing the utility of metadata: the PHA4GE SARS-CoV-2 contextual data specification package. Gigascience 2022; 11:giac003 [View Article] [PubMed]
    [Google Scholar]
  16. de Lusignan S, Liyanage H, McGagh D, Jani BD, Bauwens J et al. COVID-19 surveillance in a primary care sentinel network: in-pandemic development of an application ontology. JMIR Public Health Surveill 2020; 6:e21434 [View Article] [PubMed]
    [Google Scholar]
  17. Musen M. Demand standards to sort FAIR data from foul. Nature 2022; 609: [View Article]
    [Google Scholar]
  18. Black A, MacCannell DR, Sibley TR, Bedford T. Ten recommendations for supporting open pathogen genomic analysis in public health. Nat Med 2020; 26:832–841 [View Article] [PubMed]
    [Google Scholar]
  19. Gozashti L, Corbett-Detig R. Shortcomings of SARS-CoV-2 genomic metadata. BMC Res Notes 2021; 14:189 [View Article] [PubMed]
    [Google Scholar]
  20. Pettengill JB, Beal J, Balkey M, Allard M, Rand H et al. Interpretative labor and the bane of nonstandardized metadata in public health surveillance and food safety. Clin Infect Dis 2021; 73:1537–1539 [View Article] [PubMed]
    [Google Scholar]
  21. Schriml LM, Chuvochina M, Davies N, Eloe-Fadrosh EA, Finn RD et al. COVID-19 pandemic reveals the peril of ignoring metadata standards. Sci Data 2020; 7:188 [View Article] [PubMed]
    [Google Scholar]
  22. Stevens I, Mukarram AK, Hörtenhuber M, Meehan TF, Rung J et al. Ten simple rules for annotating sequencing experiments. PLoS Comput Biol 2020; 16:e1008260 [View Article] [PubMed]
    [Google Scholar]
  23. Smith B, Ashburner M, Rosse C, Bard J, Bug W et al. The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration. Nat Biotechnol 2007; 25:1251–1255 [View Article] [PubMed]
    [Google Scholar]
  24. Timme RE, Sanchez Leon M, Allard MW. Utilizing the public genomeTrakr database for foodborne pathogen traceback. Methods Mol Biol 2019; 1918:201–212 [View Article] [PubMed]
    [Google Scholar]
  25. Timme RE, Wolfgang WJ, Balkey M, Venkata SLG, Randolph R et al. Optimizing open data to support one health: best practices to ensure interoperability of genomic data from bacterial pathogens. One Health Outlook 2020; 2:20 [View Article] [PubMed]
    [Google Scholar]
  26. Timme RE, Woods J, Jones JL, Calci KR, Rodriguez R et al. SARS-CoV-2 wastewater variant surveillance: pandemic response leveraging FDA’s genomeTrakr network. medRxiv 2024 [View Article]
    [Google Scholar]
  27. Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016; 3:160018 [View Article] [PubMed]
    [Google Scholar]
  28. Leinonen R, Akhtar R, Birney E, Bower L, Cerdeno-Tárraga A et al. The European nucleotide archive. Nucleic Acids Res 2011; 39:D28–D31 [View Article] [PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.001260
Loading
/content/journal/mgen/10.1099/mgen.0.001260
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error