1887

Abstract

The wide adoption of bacterial genome sequencing and encoding both core and accessory genome variation using -mers has allowed bacterial genome-wide association studies (GWAS) to identify genetic variants associated with relevant phenotypes such as those linked to infection. Significant limitations still remain because of -mers being duplicated across gene clusters and as far as the interpretation of association results is concerned, which affects the wider adoption of GWAS methods on microbial data sets. We have developed a simple computational method (panfeed) that explicitly links each -mer to their gene cluster at base-resolution level, which allows us to avoid biases introduced by a global de Bruijn graph as well as more easily map and annotate associated variants. We tested panfeed on two independent data sets, correctly identifying previously characterized causal variants, which demonstrates the precision of the method, as well as its scalable performance. panfeed is a command line tool written in the python programming language and is available at https://github.com/microbial-pangenomes-lab/panfeed.

Funding
This study was supported by the:
  • Deutsche Forschungsgemeinschaft (Award GA 3191/1-1)
    • Principle Award Recipient: SommerHannes
  • Deutsche Forschungsgemeinschaft (Award 390874280)
    • Principle Award Recipient: MarcoGalardini
  • Deutsche Forschungsgemeinschaft (Award 390874280)
    • Principle Award Recipient: DilfuzaDjamalova
  • Deutsche Forschungsgemeinschaft (Award 390874280)
    • Principle Award Recipient: SommerHannes
  • This is an open-access article distributed under the terms of the Creative Commons Attribution License.
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.001129
2023-11-07
2024-04-28
Loading full text...

Full text loading...

/deliver/fulltext/mgen/9/11/mgen001129.html?itemId=/content/journal/mgen/10.1099/mgen.0.001129&mimeType=html&fmt=ahah

References

  1. Galardini M, Clermont O, Baron A, Busby B, Dion S et al. Major role of iron uptake systems in the intrinsic extra-intestinal virulence of the genus Escherichia revealed by a genome-wide association study. PLoS Genet 2020; 16:e1009065 [View Article] [PubMed]
    [Google Scholar]
  2. Laabei M, Recker M, Rudkin JK, Aldeljawi M, Gulay Z et al. Predicting the virulence of MRSA from its genome sequence. Genome Res 2014; 24:839–849 [View Article] [PubMed]
    [Google Scholar]
  3. Alam MT, Petit RA, Crispell EK, Thornton TA, Conneely KN et al. Dissecting vancomycin-intermediate resistance in Staphylococcus aureus using genome-wide association. Genome Biol Evol 2014; 6:1174–1185 [View Article] [PubMed]
    [Google Scholar]
  4. Sheppard SK, Didelot X, Meric G, Torralbo A, Jolley KA et al. Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter. Proc Natl Acad Sci U S A 2013; 110:11923–11927 [View Article] [PubMed]
    [Google Scholar]
  5. Lees JA, Croucher NJ, Goldblatt D, Nosten F, Parkhill J et al. Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration. eLife 2017; 6:e26255 [View Article] [PubMed]
    [Google Scholar]
  6. Earle SG, Lobanovska M, Lavender H, Tang C, Exley RM et al. Genome-wide association studies reveal the role of polymorphisms affecting factor H binding protein expression in host invasion by Neisseria meningitidis. PLoS Pathog 2021; 17:e1009992 [View Article] [PubMed]
    [Google Scholar]
  7. Falush D. Bacterial genomics: microbial GWAS coming of age. Nat Microbiol 2016; 1:1–2 [View Article] [PubMed]
    [Google Scholar]
  8. Power RA, Parkhill J, de Oliveira T. Microbial genome-wide association studies: lessons from human GWAS. Nat Rev Genet 2016; 18:41–50 [View Article] [PubMed]
    [Google Scholar]
  9. Kircher M, Witten DM, Jain P, O’Roak BJ, Cooper GM et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 2014; 46:310–315 [View Article] [PubMed]
    [Google Scholar]
  10. McLaren W, Gil L, Hunt SE, Riat HS, Ritchie GRS et al. The ensembl variant effect predictor. Genome Biol 2016; 17:122 [View Article] [PubMed]
    [Google Scholar]
  11. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 2020; 581:434–443 [View Article] [PubMed]
    [Google Scholar]
  12. Colquhoun RM, Hall MB, Lima L, Roberts LW, Malone KM et al. Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs. Genome Biol 2021; 22:267 [View Article] [PubMed]
    [Google Scholar]
  13. Compeau PEC, Pevzner PA, Tesler G. How to apply de Bruijn graphs to genome assembly. Nat Biotechnol 2011; 29:987–991 [View Article] [PubMed]
    [Google Scholar]
  14. Jaillard M, Lima L, Tournoud M, Mahé P, van Belkum A et al. A fast and agnostic method for bacterial genome-wide association studies: bridging the gap between k-mers and genetic events. PLoS Genet 2018; 14:e1007758 [View Article] [PubMed]
    [Google Scholar]
  15. Horsfield ST, Tonkin-Hill G, Croucher NJ, Lees JA. Accurate and fast graph-based pangenome annotation and clustering with ggCaller. Genome Res 2023; 33:1622–1637 [View Article] [PubMed]
    [Google Scholar]
  16. Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial “pan-genome.”. Proc Natl Acad Sci U S A 2005; 102:13950–13955 [View Article] [PubMed]
    [Google Scholar]
  17. Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S et al. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 2015; 31:3691–3693 [View Article] [PubMed]
    [Google Scholar]
  18. Vernikos G, Medini D, Riley DR, Tettelin H. Ten years of pan-genome analyses. Curr Opin Microbiol 2014; 23:148–154 [View Article] [PubMed]
    [Google Scholar]
  19. Roux de Bézieux H, Lima L, Perraudeau F, Mary A, Dudoit S et al. CALDERA: finding all significant de Bruijn subgraphs for bacterial GWAS. Bioinformatics 2022; 38:i36–i44 [View Article] [PubMed]
    [Google Scholar]
  20. Lobkovsky AE, Wolf YI, Koonin EV. Gene frequency distributions reject a neutral model of genome evolution. Genome Biol Evol 2013; 5:233–242 [View Article] [PubMed]
    [Google Scholar]
  21. McInerney JO, McNally A, O’Connell MJ. Why prokaryotes have pangenomes. Nat Microbiol 2017; 2:17040 [View Article] [PubMed]
    [Google Scholar]
  22. Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G et al. Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biol 2020; 21:180 [View Article] [PubMed]
    [Google Scholar]
  23. Lees JA, Galardini M, Bentley SD, Weiser JN, Corander J. pyseer: a comprehensive tool for microbial pangenome-wide association studies. bioRxiv 2018; 34:4310–4312 [View Article] [PubMed]
    [Google Scholar]
  24. Lees JA, Mai TT, Galardini M, Wheeler NE, Horsfield ST et al. Improved prediction of bacterial genotype-phenotype associations using interpretable pangenome-spanning regressions. mBio 2020; 11:e01344-20 [View Article] [PubMed]
    [Google Scholar]
  25. Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P et al. Array programming with NumPy. Nature 2020; 585:357–362 [View Article] [PubMed]
    [Google Scholar]
  26. McKinney W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference Austin, Texas: 2010 pp 56–61 [View Article]
    [Google Scholar]
  27. Shirley MD, Ma Z, Pedersen BS, Wheelan SJ. Efficient “pythonic” access to FASTA files using pyfaidx. PeerJ PrePrints 2015 [View Article]
    [Google Scholar]
  28. Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng 2007; 9:90–95 [View Article]
    [Google Scholar]
  29. Waskom ML. seaborn: statistical data visualization. JOSS 2021; 6:3021 [View Article]
    [Google Scholar]
  30. Denamur E, Condamine B, Esposito-Farèse M, Royer G, Clermont O et al. Genome wide association study of Escherichia coli bloodstream infection isolates identifies genetic determinants for the portal of entry but not fatal outcome. PLoS Genet 2022; 18:e1010112 [View Article] [PubMed]
    [Google Scholar]
  31. Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 2010; 26:841–842 [View Article] [PubMed]
    [Google Scholar]
  32. Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW et al. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res 2019; 29:304–316 [View Article] [PubMed]
    [Google Scholar]
  33. Strömberg N, Marklund BI, Lund B, Ilver D, Hamers A et al. Host-specificity of uropathogenic Escherichia coli depends on differences in binding specificity to Gal alpha 1-4Gal-containing isoreceptors. EMBO J 1990; 9:2001–2010 [View Article] [PubMed]
    [Google Scholar]
  34. Biggel M, Xavier BB, Johnson JR, Nielsen KL, Frimodt-Møller N et al. Horizontally acquired papGII-containing pathogenicity islands underlie the emergence of invasive uropathogenic Escherichia coli lineages. Nat Commun 2020; 11:5968 [View Article] [PubMed]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.001129
Loading
/content/journal/mgen/10.1099/mgen.0.001129
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF

Supplementary material 2

EXCEL
This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error