1887

Abstract

The COVID-19 pandemic has seen large-scale pathogen genomic sequencing efforts, becoming part of the toolbox for surveillance and epidemic research. This resulted in an unprecedented level of data sharing to open repositories, which has actively supported the identification of SARS-CoV-2 structure, molecular interactions, mutations and variants, and facilitated vaccine development and drug reuse studies and design. The European COVID-19 Data Platform was launched to support this data sharing, and has resulted in the deposition of several million SARS-CoV-2 raw reads. In this paper we describe (1) open data sharing, (2) tools for submission, analysis, visualisation and data claiming (e.g. ORCiD), (3) the systematic analysis of these datasets, at scale via the SARS-CoV-2 Data Hubs as well as (4) lessons learnt. This paper describes a component of the Platform, the SARS-CoV-2 Data Hubs, which enable the extension and set up of infrastructure that we intend to use more widely in the future for pathogen surveillance and pandemic preparedness.

  • This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.001188
2024-02-15
2024-12-03
Loading full text...

Full text loading...

/deliver/fulltext/mgen/10/2/mgen001188.html?itemId=/content/journal/mgen/10.1099/mgen.0.001188&mimeType=html&fmt=ahah

References

  1. Lu R, Zhao X, Li J, Niu P, Yang B et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. Lancet 2020; 395:565–574 [View Article] [PubMed]
    [Google Scholar]
  2. Cucinotta D, Vanelli M. WHO declares COVID-19 a pandemic. Acta Biomed 2020; 91:157–160 [View Article] [PubMed]
    [Google Scholar]
  3. Oude Munnink BB, Worp N, Nieuwenhuijse DF, Sikkema RS, Haagmans B et al. The next phase of SARS-CoV-2 surveillance: real-time molecular epidemiology. Nat Med 2021; 27:1518–1524 [View Article] [PubMed]
    [Google Scholar]
  4. Nicholls SM, Poplawski R, Bull MJ, Underwood A, Chapman M et al. CLIMB-COVID: continuous integration supporting decentralised sequencing for SARS-CoV-2 genomic surveillance. Genome Biol 2021; 22:196 [View Article] [PubMed]
    [Google Scholar]
  5. Harrison PW, Lopez R, Rahman N, Allen SG, Aslam R et al. The COVID-19 data portal: accelerating SARS-CoV-2 and COVID-19 research through rapid open access data sharing. Nucleic Acids Res 2021; 49:W619–W623 [View Article] [PubMed]
    [Google Scholar]
  6. Freeberg MA, Fromont LA, D’Altri T, Romero AF, Ciges JI et al. The European genome-phenome archive in 2021. Nucleic Acids Res 2022; 50:D980–D987 [View Article] [PubMed]
    [Google Scholar]
  7. Amid C, Pakseresht N, Silvester N, Jayathilaka S, Lund O et al. The COMPARE data hubs. Database 2019; 2019:baz136 [View Article] [PubMed]
    [Google Scholar]
  8. Burgin J, Ahamed A, Cummins C, Devraj R, Gueye K et al. The European Nucleotide Archive in 2022. Nucleic Acids Res 2023; 51:D121–D125 [View Article] [PubMed]
    [Google Scholar]
  9. Cantelli G, Bateman A, Brooksbank C, Petrov AI, Malik-Sheriff RS et al. The European Bioinformatics Institute (EMBL-EBI) in 2021. Nucleic Acids Res 2022; 50:D11–D19 [View Article] [PubMed]
    [Google Scholar]
  10. International Nucleotide Sequence Database Collaboration [Internet] n.d https://www.insdc.org/ accessed 18 April 2023
  11. About VEO - VEO Europe [Internet]. n.d https://www.veo-europe.eu/about-veo accessed 17 February 2023
  12. Lam C, Gray K, Gall M, Sadsad R, Arnott A et al. SARS-CoV-2 genome sequencing methods differ in their abilities to detect variants from low-viral-load samples. J Clin Microbiol 2021; 59:e0104621 [View Article] [PubMed]
    [Google Scholar]
  13. Harvey WT, Carabelli AM, Jackson B, Gupta RK, Thomson EC et al. SARS-CoV-2 variants, spike mutations and immune escape. Nat Rev Microbiol 2021; 19:409–424 [View Article] [PubMed]
    [Google Scholar]
  14. covid-sequence-analysis-workflow [Internet] European Nucleotide Archive; 2022 https://github.com/enasequence/covid-sequence-analysis-workflow accessed 18 April 2023
  15. Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 2014; 30:2114–2120 [View Article] [PubMed]
    [Google Scholar]
  16. Li H. Aligning sequence reads, clone sequences and assembly Contigs with BWA-MEM; 2013
  17. Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V et al. Twelve years of SAMtools and BCFtools. Gigascience 2021; 10:giab008 [View Article] [PubMed]
    [Google Scholar]
  18. Wilm A, Aw PPK, Bertrand D, Yeo GHT, Ong SH et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res 2012; 40:11189–11201 [View Article] [PubMed]
    [Google Scholar]
  19. Cingolani P, Platts A, Wang LL, Coon M, Nguyen T et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly 2012; 6:80–92 [View Article] [PubMed]
    [Google Scholar]
  20. Martin M. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet J 2011; 17:10 [View Article]
    [Google Scholar]
  21. Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 2018; 34:3094–3100 [View Article] [PubMed]
    [Google Scholar]
  22. covid-sequence-analysis-workflow/vcf2consensus.py at master · enasequence/covid-sequence-analysis-workflow [Internet]. GitHub; n.d https://github.com/enasequence/covid-sequence-analysis-workflow accessed 17 February 2023
  23. dca-analysis-tools/ena-pangolin-lineage at main · enasequence/dca-analysis-tools [Internet]. GitHub; n.d https://github.com/enasequence/dca-analysis-tools accessed 17 February 2023
  24. O’Toole Á, Scher E, Underwood A, Jackson B, Hill V et al. Assignment of epidemiological lineages in an emerging pandemic using the pangolin tool. Virus Evol 2021; 7:veab064 [View Article] [PubMed]
    [Google Scholar]
  25. Mölder F, Jablonski KP, Letcher B, Hall MB, Tomkins-Tinch CH. Sustainable data analysis with Snakemake [Internet]. F1000Research; 2021 https://github.com/enasequence/dca-analysis-tools accessed 17 February 2023
  26. scorpio [Internet]. CoV-lineages; 2022 https://github.com/cov-lineages/scorpio accessed 18 April 2023
  27. ena-content-dataflow/get_repr_seqs.py at master · enasequence/ena-content-dataflow [Internet]. n.d https://github.com/enasequence/ena-content-dataflow/blob/master/scripts/get_repr_seqs.py accessed 17 February 2023
  28. Szarvas J, Ahrenfeldt J, Cisneros JLB, Thomsen MCF, Aarestrup FM et al. Large scale automated phylogenomic analysis of bacterial isolates and the Evergreen Online platform. Commun Biol 2020; 3:137 [View Article] [PubMed]
    [Google Scholar]
  29. genomicepidemiology / ebi_viral_phylogeny — Bitbucket [Internet]. n.d https://bitbucket.org/genomicepidemiology/ebi_viral_phylogeny/src/master/ accessed 18 April 2023
  30. Clausen P, Aarestrup FM, Lund O. Rapid and precise alignment of raw reads against redundant databases with KMA. BMC Bioinformatics 2018; 19:307 [View Article] [PubMed]
    [Google Scholar]
  31. Clausen PTLC. Scaling neighbor joining to one million taxa with dynamic and heuristic neighbor joining. Bioinformatics 2023; 39:btac774 [View Article] [PubMed]
    [Google Scholar]
  32. genomicepidemiology / phylodash — Bitbucket [Internet]. n.d https://bitbucket.org/genomicepidemiology/phylodash/src/main/ accessed 18 April 2023
  33. OpenStreetMap contributors Planet dump retrieved from; 2017 https://planet.osm.org
  34. Phylocanvas.gl [Internet]. Phylocanvas.gl. n.d https://www.phylocanvas.gl/
  35. Kooplex [Internet]. n.d https://k8plex-veo.vo.elte.hu/hub/
  36. CoVEO: COVID-19 Data Portal. Internet n.d https://www.covid19dataportal.org/coveo accessed 17 February 2023
    [Google Scholar]
  37. Johns Hopkins Coronavirus Resource Center; n.d https://coronavirus.jhu.edu/map.html accessed 17 February 2023
  38. Mentes A, Papp K, Visontai D, Stéger J, Csabai I et al. Identification of mutations in SARS-CoV-2 PCR primer regions. Sci Rep 2022; 12:18651 [View Article] [PubMed]
    [Google Scholar]
  39. Webin-CLI Submission — ENA Training Modules 1 documentation [Internet]. n.d https://ena-docs.readthedocs.io/en/latest/submit/general-guide/webin-cli.html accessed 18 April 2023
  40. SARS-CoV-2 Drag and Drop Uploader [Internet]. n.d https://ebi-ait.github.io/sars-cov2-data-upload/app-documentation accessed 17 February 2023
  41. Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill 2017; 22:30494 [View Article] [PubMed]
    [Google Scholar]
  42. ena-content-dataflow/scripts/gisaid_to_ena at master · enasequence/ena-content-dataflow [Internet]. GitHub; n.d https://github.com/enasequence/ena-content-dataflow accessed 17 February 2023
  43. ENA Webin-CLI Bulk Submission Tool [Internet]. European Nucleotide Archive; 2022 https://github.com/enasequence/ena-bulk-webincli accessed 17 February 2023
  44. ena-analysis-submitter [Internet]. European Nucleotide Archive; 2022 https://github.com/enasequence/ena-analysis-submitter accessed 17 February 2023
  45. Haak LL, Fenner M, Paglione L, Pentz E, Ratner H. ORCID: a system to uniquely identify researchers. Learn Publish 2012; 25:259–264 [View Article]
    [Google Scholar]
  46. Institute EB ORCID claiming | EBI Search | EMBL-EBI [Internet]. n.d https://www.ebi.ac.uk/ebisearch/documentation/orcid-claiming accessed 17 February 2023
  47. Liu J. Digital Object Identifier (DOI) and DOI services: an overview. Libri 2021; 71:349–360 [View Article]
    [Google Scholar]
  48. Sarkans U, Gostev M, Athar A, Behrangi E, Melnichuk O et al. The BioStudies database-one stop shop for all data supporting a life sciences study. Nucleic Acids Res 2018; 46:D1266–D1270 [View Article] [PubMed]
    [Google Scholar]
  49. Wu F, Zhao S, Yu B, Chen Y-M, Wang W et al. A new coronavirus associated with human respiratory disease in China. Nature 2020; 579:265–269 [View Article] [PubMed]
    [Google Scholar]
  50. CDC Coronavirus Disease 2019 (COVID-19) [Internet]. Centers for Disease Control and Prevention; 2020 https://www.cdc.gov/coronavirus/2019-ncov/variants/variant-classifications.html accessed 18 April 2023
  51. COVID-19 Data Portal - accelerating scientific research through data [Internet]. n.d https://www.covid19dataportal.org/statistics accessed 18 April 2023
  52. The Pathogens Portal [Internet]. n.d https://www.ebi.ac.uk/ena/pathogens/v2/ accessed 17 February 2023
  53. COVID-19 Data Portal - Viral Seqeunces [Internet]. n.d https://www.covid19dataportal.org/search/sequences accessed 20 February 2023
  54. Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E et al. Nextflow enables reproducible computational workflows. Nat Biotechnol 2017; 35:316–319 [View Article] [PubMed]
    [Google Scholar]
  55. Merkel D. Docker: lightweight Linux containers for consistent development and deployment. Linux J 2014; 2014:2
    [Google Scholar]
  56. Kurtzer GM, Sochat V, Bauer MW. Singularity: Scientific containers for mobility of compute. PLoS One 2017; 12:e0177459 [View Article] [PubMed]
    [Google Scholar]
  57. BigQuery API [Internet]. Google Cloud; n.d https://cloud.google.com/bigquery/docs/reference/rest accessed 17 February 2023
  58. Cezard T, Cunningham F, Hunt SE, Koylass B, Kumar N et al. The European variation archive: a FAIR resource of genomic variation for all species. Nucleic Acids Res 2022; 50:D1216–D1220 [View Article] [PubMed]
    [Google Scholar]
  59. Howe KL, Achuthan P, Allen J, Allen J, Alvarez-Jarreta J et al. Ensembl 2021. Nucleic Acids Res 2021; 49:D884–D891 [View Article] [PubMed]
    [Google Scholar]
  60. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS et al. The ensembl variant effect predictor. Genome Biol 2016; 17:122 [View Article] [PubMed]
    [Google Scholar]
  61. De Silva NH, Bhai J, Chakiachvili M, Contreras-Moreira B, Cummins C et al. The ensembl COVID-19 resource: ongoing integration of public SARS-CoV-2 data. Nucleic Acids Res 2022; 50:D765–D770 [View Article] [PubMed]
    [Google Scholar]
  62. pkrisz5 CoVEO [Internet]. n.d https://github.com/pkrisz5/coveo accessed 18 April 2023
  63. Chen Z, Azman AS, Chen X, Zou J, Tian Y et al. Global landscape of SARS-CoV-2 genomic surveillance and data sharing. Nat Genet 2022; 54:499–507 [View Article] [PubMed]
    [Google Scholar]
  64. Asokan GV, Ramadhan T, Ahmed E, Sanad H. WHO global priority pathogens list: a bibliometric analysis of medline-PubMed for knowledge mobilization to infection prevention and control practices in Bahrain. Oman Med J 2019; 34:184–193 [View Article] [PubMed]
    [Google Scholar]
  65. The Galaxy Community The Galaxy platform for accessible, reproducible and collaborative biomedical analyses: 2022 update. Nucleic Acids Res 2022; 50:W345–51 [View Article] [PubMed]
    [Google Scholar]
  66. CRG Viral Beacon - Info [Internet]. n.d https://covid19beacon.crg.eu/info accessed 17 February 2023
  67. Chen C, Nadeau S, Yared M, Voinov P, Xie N et al. CoV-Spectrum: analysis of globally shared SARS-CoV-2 data to identify and characterize new variants. Bioinformatics 2022; 38:1735–1737 [View Article] [PubMed]
    [Google Scholar]
  68. Yang J. Cloud computing for storing and analyzing petabytes of genomic data. J Ind Inf Integr 2019; 15:50–57 [View Article]
    [Google Scholar]
  69. Nextstrain / ncov / open / global / all-time [Internet]. n.d https://nextstrain.org/ncov/open/global/all-time accessed 18 April 2023
  70. Sanderson T. Taxonium, a web-based tool for exploring large phylogenetic trees. Elife 2022; 11:e82392 [View Article] [PubMed]
    [Google Scholar]
/content/journal/mgen/10.1099/mgen.0.001188
Loading
/content/journal/mgen/10.1099/mgen.0.001188
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error