1887

Abstract

There is a need to identify microbial sequences that may form part of transmission chains, or that may represent importations across national boundaries, amidst large numbers of SARS-CoV-2 and other bacterial or viral sequences. Reference-based compression is a sequence analysis technique that allows both a compact storage of sequence data and comparisons between sequences. Published implementations of the approach are being challenged by the large sample collections now being generated. Our aim was to develop a fast software detecting highly similar sequences in large collections of microbial genomes, including millions of SARS-CoV-2 genomes. To do so, we developed Catwalk, a tool that bypasses bottlenecks in the generation, comparison and in-memory storage of microbial genomes generated by reference mapping. It is a compiled solution, coded in Nim to increase performance. It can be accessed via command line, or web server interfaces. We tested Catwalk using both SARS-CoV-2 and genomes generated by prospective public-health sequencing programmes. Pairwise sequence comparisons, using clinically relevant similarity cut-offs, took about 0.39 and 0.66 μs, respectively; in 1 s, between 1 and 2 million sequences can be searched. Catwalk operates about 1700 times faster than, and uses about 8 % of the RAM of, a Python reference-based compression and comparison tool in current use for outbreak detection. Catwalk can rapidly identify close relatives of a SARS-CoV-2 or genome amidst millions of samples.

  • This is an open-access article distributed under the terms of the Creative Commons Attribution License. This article was made open access via a Publish and Read agreement between the Microbiology Society and the corresponding author’s institution.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000850
2022-06-30
2024-04-19
Loading full text...

Full text loading...

/deliver/fulltext/mgen/8/6/mgen000850.html?itemId=/content/journal/mgen/10.1099/mgen.0.000850&mimeType=html&fmt=ahah

References

  1. da Silva Filipe A, Shepherd JG, Williams T, Hughes J, Aranday-Cortes E et al. Genomic epidemiology reveals multiple introductions of SARS-CoV-2 from mainland Europe into Scotland. Nat Microbiol 2021; 6:112–122 [View Article] [PubMed]
    [Google Scholar]
  2. Tordoff DM, Greninger AL, Roychoudhury P, Shrestha L, Xie H et al. Phylogenetic estimates of SARS-CoV-2 introductions into Washington State. Lancet Reg Health Am 2021; 1:100018 [View Article] [PubMed]
    [Google Scholar]
  3. Allix-Béguec C, Arandjelovic I, Bi L, Beckert P, Bonnet M et al. Prediction of susceptibility to first-line tuberculosis drugs by DNA sequencing. N Engl J Med 2018; 379:1403–1415 [View Article] [PubMed]
    [Google Scholar]
  4. Nikolayevskyy V, Niemann S, Anthony R, van Soolingen D, Tagliani E et al. Role and value of whole genome sequencing in studying tuberculosis transmission. Clin Microbiol Infect 2019; 25:1377–1382 [View Article] [PubMed]
    [Google Scholar]
  5. Godfroid M, Dagan T, Merker M, Kohl TA, Diel R et al. Insertion and deletion evolution reflects antibiotics selection pressure in a Mycobacterium tuberculosis outbreak. PLoS Pathog 2020; 16:e1008357 [View Article] [PubMed]
    [Google Scholar]
  6. Rochman ND, Wolf YI, Faure G, Mutz P, Zhang F et al. Ongoing global and regional adaptive evolution of SARS-CoV-2. Proc Natl Acad Sci USA 2021; 118:e2104241118 [View Article] [PubMed]
    [Google Scholar]
  7. Harvey WT, Carabelli AM, Jackson B, Gupta RK, Thomson EC et al. SARS-CoV-2 variants, spike mutations and immune escape. Nat Rev Microbiol 2021; 19:409–424 [View Article] [PubMed]
    [Google Scholar]
  8. Mazariegos-Canellas O, Do T, Peto T, Eyre DW, Underwood A et al. BugMat and FindNeighbour: command line and server applications for investigating bacterial relatedness. BMC Bioinformatics 2017; 18:477 [View Article] [PubMed]
    [Google Scholar]
  9. Matsen FA, Kodner RB, Armbrust EV. pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree. BMC Bioinformatics 2010; 11:538 [View Article] [PubMed]
    [Google Scholar]
  10. Kohl TA, Harmsen D, Rothgänger J, Walker T, Diel R et al. Harmonized genome wide typing of tubercle bacilli using a web-based gene-by-gene nomenclature system. EBioMedicine 2018; 34:131–138 [View Article] [PubMed]
    [Google Scholar]
  11. Cython documentation and source code: setobject.c; 2021 https://github.com/python/cpython/blob/main/Objects/setobject.c accessed 1 December 2021
  12. C++ symmetric difference set algorithm; 2021 https://www.cplusplus.com/reference/algorithm/set_symmetric_difference/ accessed 1 December 2021
  13. Lumley SF, Constantinides B, Sanderson N, Rodger G, Street TL et al. Epidemiological data and genome sequencing reveals that nosocomial transmission of SARS-CoV-2 is underestimated and mostly mediated by a small number of highly infectious individuals. J Infect 2021; 83:473–482 [View Article] [PubMed]
    [Google Scholar]
  14. FindNeighbour4: a server application for investigating bacterial relatedness using reference-mapped data; 2021 https://github.com/davidhwyllie/findNeighbour4 accessed 1 December 2021
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000850
Loading
/content/journal/mgen/10.1099/mgen.0.000850
Loading

Data & Media loading...

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error