1887

Abstract

Genome-wide association studies (GWASs) have the potential to reveal the genetics of microbial phenotypes such as antibiotic resistance and virulence. Capitalizing on the growing wealth of bacterial sequence data, microbial GWAS methods aim to identify causal genetic variants while ignoring spurious associations. Bacteria reproduce clonally, leading to strong population structure and genome-wide linkage, making it challenging to separate true ‘hits’ (i.e. mutations that cause a phenotype) from non-causal linked mutations. GWAS methods attempt to correct for population structure in different ways, but their performance has not yet been systematically and comprehensively evaluated under a range of evolutionary scenarios. Here, we developed a bacterial GWAS simulator (BacGWASim) to generate bacterial genomes with varying rates of mutation, recombination and other evolutionary parameters, along with a subset of causal mutations underlying a phenotype of interest. We assessed the performance (recall and precision) of three widely used single-locus GWAS approaches (cluster-based, dimensionality-reduction and linear mixed models, implemented in , pyseer and ) and one relatively new multi-locus model implemented in pyseer, across a range of simulated sample sizes, recombination rates and causal mutation effect sizes. As expected, all methods performed better with larger sample sizes and effect sizes. The performance of clustering and dimensionality reduction approaches to correct for population structure were considerably variable according to the choice of parameters. Notably, the multi-locus elastic net (lasso) approach was consistently amongst the highest-performing methods, and had the highest power in detecting causal variants with both low and high effect sizes. Most methods reached the level of good performance (recall >0.75) for identifying causal mutations of strong effect size [log odds ratio (OR) ≥2] with a sample size of 2000 genomes. However, only elastic nets reached the level of reasonable performance (recall=0.35) for detecting markers with weaker effects (log OR ~1) in smaller samples. Elastic nets also showed superior precision and recall in controlling for genome-wide linkage, relative to single-locus models. However, all methods performed relatively poorly on highly clonal (low-recombining) genomes, suggesting room for improvement in method development. These findings show the potential for multi-locus models to improve bacterial GWAS performance. BacGWASim code and simulated data are publicly available to enable further comparisons and benchmarking of new methods.

Funding
This study was supported by the:
  • B. Jesse Shapiro , Génome Québec , (Award BCB)
  • B. Jesse Shapiro , Genome Canada , (Award BCB)
Loading

Article metrics loading...

/content/journal/mgen/10.1099/mgen.0.000337
2020-02-25
2020-06-04
Loading full text...

Full text loading...

/deliver/fulltext/mgen/6/3/mgen000337.html?itemId=/content/journal/mgen/10.1099/mgen.0.000337&mimeType=html&fmt=ahah

References

  1. Buniello A, MacArthur JAL, Cerezo M, Harris LW, Hayhurst J et al. The NHGRI-EBI GWAS catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic Acids Res 2019; 47:D1005–D1012 [CrossRef]
    [Google Scholar]
  2. Bille E, Zahar J-R, Perrin A, Morelle S, Kriz P et al. A chromosomally integrated bacteriophage in invasive meningococci. J Exp Med 2005; 201:1905–1913 [CrossRef]
    [Google Scholar]
  3. Falush D, Bowden R. Genome-wide association mapping in bacteria?. Trends Microbiol 2006; 14:353–355 [CrossRef]
    [Google Scholar]
  4. Alam MT, Petit RA, Crispell EK, Thornton TA, Conneely KN et al. Dissecting vancomycin-intermediate resistance in Staphylococcus aureus using genome-wide association. Genome Biol Evol 2014; 6:1174–1185 [CrossRef]
    [Google Scholar]
  5. Desjardins CA, Cohen KA, Munsamy V, Abeel T, Maharaj K et al. Genomic and functional analyses of Mycobacterium tuberculosis strains implicate ALD in D-cycloserine resistance. Nat Genet 2016; 48:544–551 [CrossRef]
    [Google Scholar]
  6. Chewapreecha C, Marttinen P, Croucher NJ, Salter SJ, Harris SR et al. Comprehensive identification of single nucleotide polymorphisms associated with beta-lactam resistance within pneumococcal mosaic genes. PLoS Genet 2014; 10:e1004547 [CrossRef]
    [Google Scholar]
  7. Lees JA, Croucher NJ, Goldblatt D, Nosten F, Parkhill J et al. Genome-wide identification of lineage and locus specific variation associated with pneumococcal carriage duration. Elife 2017; 6:e26255 [CrossRef]
    [Google Scholar]
  8. Li Y, Metcalf BJ, Chochua S, Li Z, Walker H et al. Genome-Wide association analyses of invasive pneumococcal isolates identify a missense bacterial mutation associated with meningitis. Nat Commun 2019; 10:178 [CrossRef]
    [Google Scholar]
  9. Farhat MR, Shapiro BJ, Kieser KJ, Sultana R, Jacobson KR et al. Genomic analysis identifies targets of convergent positive selection in drug-resistant Mycobacterium tuberculosis . Nat Genet 2013; 45:1183–1189 [CrossRef]
    [Google Scholar]
  10. Farhat MR, Freschi L, Calderon R, Ioerger T, Snyder M et al. GWAS for quantitative resistance phenotypes in Mycobacterium tuberculosis reveals resistance genes and regulatory regions. Nat Commun 2019; 10:2128 [CrossRef]
    [Google Scholar]
  11. Berthenet E, Yahara K, Thorell K, Pascoe B, Meric G et al. A GWAS on Helicobacter pylori strains points to genetic variants associated with gastric cancer risk. BMC Biol 2018; 16:84 [CrossRef]
    [Google Scholar]
  12. Laabei M, Recker M, Rudkin JK, Aldeljawi M, Gulay Z et al. Predicting the virulence of MRSA from its genome sequence. Genome Res 2014; 24:839–849 [CrossRef]
    [Google Scholar]
  13. Maury MM, Tsai Y-H, Charlier C, Touchon M, Chenal-Francisque V et al. Uncovering Listeria monocytogenes hypervirulence by harnessing its biodiversity. Nat Genet 2016; 48:308–313 [CrossRef]
    [Google Scholar]
  14. Sheppard SK, Didelot X, Meric G, Torralbo A, Jolley KA et al. Genome-wide association study identifies vitamin B5 biosynthesis as a host specificity factor in Campylobacter . Proc Natl Acad Sci USA 2013; 110:11923–11927 [CrossRef]
    [Google Scholar]
  15. Corander J, Croucher NJ, Harris SR, Lees JA, Tonkin-Hill G. Bacterial population genomics. In: Balding D, Moltke I and Marioni J (eds). Handbook of Statistical Genomics Hoboken, NJ: Wiley; 2019 pp 997–1020
    [Google Scholar]
  16. Collins C, Didelot X. A phylogenetic method to perform genome-wide association studies in microbes that accounts for population structure and recombination. PLoS Comput Biol 2018; 14:e1005958 [CrossRef]
    [Google Scholar]
  17. Power RA, Parkhill J, de Oliveira T. Microbial genome-wide association studies: lessons from human GWAS. Nat Rev Genet 2017; 18:41–50 [CrossRef]
    [Google Scholar]
  18. Chen PE, Shapiro BJ. The advent of genome-wide association studies for bacteria. Curr Opin Microbiol 2015; 25:17–24 [CrossRef]
    [Google Scholar]
  19. Lees JA, Vehkala M, Välimäki N, Harris SR, Chewapreecha C et al. Sequence element enrichment analysis to determine the genetic basis of bacterial phenotypes. Nat Commun 2016; 7:12797 [CrossRef]
    [Google Scholar]
  20. Salipante SJ, Roach DJ, Kitzman JO, Snyder MW, Stackhouse B et al. Large-scale genomic sequencing of extraintestinal pathogenic Escherichia coli strains. Genome Res 2015; 25:119–128 [CrossRef]
    [Google Scholar]
  21. Bartha I, Carlson JM, Brumme CJ, McLaren PJ, Brumme ZL et al. A genome-to-genome analysis of associations between human genetic variation, HIV-1 sequence diversity, and viral control. Elife 2013; 2:e01123 [CrossRef]
    [Google Scholar]
  22. Mostowy R, Croucher NJ, Andam CP, Corander J, Hanage WP et al. Efficient inference of recent and ancestral recombination within bacterial populations. Mol Biol Evol 2017; 34:1167–1182 [CrossRef]
    [Google Scholar]
  23. Power RA, Davaniah S, Derache A, Wilkinson E, Tanser F et al. Genome-wide association study of HIV whole genome sequences validated using drug resistance. PLoS One 2016; 11:e0163746 [CrossRef]
    [Google Scholar]
  24. Earle SG, Wu C-H, Charlesworth J, Stoesser N, Gordon NC et al. Identifying lineage effects when controlling for population structure improves power in bacterial association studies. Nat Microbiol 2016; 1:16041 [CrossRef]
    [Google Scholar]
  25. Jaillard M, Lima L, Tournoud M, Mahé P, van Belkum A et al. A fast and agnostic method for bacterial genome-wide association studies: bridging the gap between k-mers and genetic events. PLoS Genet 2018; 14:e1007758 [CrossRef]
    [Google Scholar]
  26. Barrett JC, Fry B, Maller J, Daly MJ. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics 2005; 21:263–265 [CrossRef]
    [Google Scholar]
  27. Hodge SE, Greenberg DA. How can we explain very low odds ratios in GWAS? I. Polygenic models. Hum Hered 2016; 81:173–180 [CrossRef]
    [Google Scholar]
  28. Miotto P, Tessema B, Tagliani E, Chindelevitch L, Starks AM et al. A standardised method for interpreting the association between mutations and phenotypic drug resistance in Mycobacterium tuberculosis . Eur Respir J 2017; 50:1701354 [CrossRef]
    [Google Scholar]
  29. Gernhard T. The conditioned reconstructed process. J Theor Biol 2008; 253:769–778 [CrossRef]
    [Google Scholar]
  30. Dalquen DA, Anisimova M, Gonnet GH, Dessimoz C. ALF – a simulation framework for genome evolution. Mol Biol Evol 2012; 29:1115–1123 [CrossRef]
    [Google Scholar]
  31. Cartwright RA. DNA assembly with gaps (Dawg): simulating sequence evolution. Bioinformatics 2005; 21:iii31–iii38 [CrossRef]
    [Google Scholar]
  32. Huang W, Li L, Myers JR, Marth GT. Art: a next-generation sequencing read simulator. Bioinformatics 2012; 28:593–594 [CrossRef]
    [Google Scholar]
  33. Huang W, Umbach DM, Li L. Accurate anchoring alignment of divergent sequences. Bioinformatics 2006; 22:29–34 [CrossRef]
    [Google Scholar]
  34. Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 2009; 25:1754–1760 [CrossRef]
    [Google Scholar]
  35. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K et al. The genome analysis toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res 2010; 20:1297–1303 [CrossRef]
    [Google Scholar]
  36. Yang J, Lee SH, Goddard ME, Visscher PM. GCTA: a tool for genome-wide complex trait analysis. Am J Hum Genet 2011; 88:76–82 [CrossRef]
    [Google Scholar]
  37. Chewapreecha C, Harris SR, Croucher NJ, Turner C, Marttinen P et al. Dense genomic sampling identifies highways of pneumococcal recombination. Nat Genet 2014; 46:305–309 [CrossRef]
    [Google Scholar]
  38. Li H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 2011; 27:2987–2993 [CrossRef]
    [Google Scholar]
  39. Jia B, Raphenya AR, Alcock B, Waglechner N, Guo P et al. Card 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res 2017; 45:D566–D573 [CrossRef]
    [Google Scholar]
  40. Seemann T Snippy: fast bacterial variant calling from NGS reads 2019 https://github.com/tseemann/snippy
  41. Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am J Hum Genet 2007; 81:559–575 [CrossRef]
    [Google Scholar]
  42. Lees JA, Galardini M, Bentley SD, Weiser JN, Corander J. pyseer: a comprehensive tool for microbial pangenome-wide association studies. Bioinformatics 2018; 34:4310–4312 [CrossRef]
    [Google Scholar]
  43. Lippert C, Listgarten J, Liu Y, Kadie CM, Davidson RI et al. Fast linear mixed models for genome-wide association studies. Nat Methods 2011; 8:833–835 [CrossRef]
    [Google Scholar]
  44. Zhou X, Stephens M. Genome-wide efficient mixed-model analysis for association studies. Nat Genet 2012; 44:821–824 [CrossRef]
    [Google Scholar]
  45. Lees JA, Mai TT, Galardini M, Wheeler NE, Corander J. Improved inference and prediction of bacterial genotype-phenotype associations using pangenome-spanning regressions. bioRxiv 2019; 852426:
    [Google Scholar]
  46. Raschka S. Python Machine Learning Birmingham: Packt Publishing; 2015
    [Google Scholar]
  47. Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 2006; 22:1540–1542 [CrossRef]
    [Google Scholar]
  48. Corander J, Marttinen P, Sirén J, Tang J. Enhanced Bayesian modelling in BAPS software for learning genetic structures of populations. BMC Bioinformatics 2008; 9:539 [CrossRef]
    [Google Scholar]
  49. Lees JA, Harris SR, Tonkin-Hill G, Gladstone RA, Lo SW et al. Fast and flexible bacterial genomic epidemiology with PopPUNK. Genome Res 2019; 29:304–316 [CrossRef]
    [Google Scholar]
  50. Brown T, Didelot X, Wilson DJ, Maio ND, De Maio N. SimBac: simulation of whole bacterial genomes with homologous recombination. Microb Genom 2016; 2:e000044 [CrossRef]
    [Google Scholar]
  51. Sipola A, Marttinen P, Corander J. Bacmeta: simulator for genomic evolution in bacterial metapopulations. Bioinformatics 2018; 34:2308–2310 [CrossRef]
    [Google Scholar]
  52. Farhat MR, Shapiro BJ, Sheppard SK, Colijn C, Murray M. A phylogeny-based sampling strategy and power calculator informs genome-wide associations study design for microbial pathogens. Genome Med 2014; 6:101 [CrossRef]
    [Google Scholar]
http://instance.metastore.ingenta.com/content/journal/mgen/10.1099/mgen.0.000337
Loading
/content/journal/mgen/10.1099/mgen.0.000337
Loading

Data & Media loading...

Supplements

Supplementary material 1

PDF

Most cited this month Most Cited RSS feed

This is a required field
Please enter a valid email address
Approval was a Success
Invalid data
An Error Occurred
Approval was partially successful, following selected items could not be processed due to error